Skip to content
Snippets Groups Projects
Commit 3555d245 authored by David Fuhry's avatar David Fuhry
Browse files

Changes described in previous commit

parent 34cb0be3
No related branches found
No related tags found
1 merge request!6Use wikipedia api
...@@ -35,12 +35,7 @@ python -m rasa_core.run -d models/dialogue -u models/current/nlu ...@@ -35,12 +35,7 @@ python -m rasa_core.run -d models/dialogue -u models/current/nlu
### PhysicistsList.R ### PhysicistsList.R
Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and save them in a file *Physicists.txt* in the data directory. Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and use that list to download the corresponding articles from the wikipedia api.
Use that file to generate xml dump at wikipedias [Export page](https://en.wikipedia.org/wiki/Special:Export) Will generate a csv containing the gathered articles in the data directory as well as a RDS object containing the data as binary.
### ExtractFromXML.Rasa
Will read in the xml file from the data directory and extract the title and text of the pages in the dump. Will then write them to *texte.csv* in the data directory, use `read.table` to import. For convenience will also create a texte.RDS file, load with `texte <- readRDS("../data/texte.RDS")`.
**NOTE:** For the script to work, the first line of the xml needs to be replaced with `<mediawiki xml:lang="en">`.
...@@ -72,4 +72,5 @@ articles <- do.call(rbind, articles) ...@@ -72,4 +72,5 @@ articles <- do.call(rbind, articles)
write.table(articles, "../data/articles.csv") write.table(articles, "../data/articles.csv")
saveRDS(articles, "../data/articles.RDS")
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment