Changes described in previous commit

3555d245 · David Fuhry · 34cb0be3 · 3555d245 · 3555d245
Commit 3555d245 authored 6 years ago by David Fuhry
--- a/README.md
+++ b/README.md
@@ -35,12 +35,7 @@ python -m rasa_core.run -d models/dialogue -u models/current/nlu
 ### PhysicistsList.R
-Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and save them in a file *Physicists.txt* in the data directory.
+Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and use that list to download the corresponding articles from the wikipedia api.
-Use that file to generate xml dump at wikipedias [Export page](https://en.wikipedia.org/wiki/Special:Export)
+Will generate a csv containing the gathered articles in the data directory as well as a RDS object containing the data as binary.
-### ExtractFromXML.Rasa
-Will read in the xml file from the data directory and extract the title and text of the pages in the dump. Will then write them to *texte.csv* in the data directory, use `read.table` to import.  For convenience will also create a texte.RDS file, load with `texte <- readRDS("../data/texte.RDS")`.
-**NOTE:** For the script to work, the first line of the xml needs to be replaced with `<mediawiki xml:lang="en">`.
--- a/r/GetData.R
+++ b/r/GetData.R
@@ -72,4 +72,5 @@ articles <- do.call(rbind, articles)
 write.table(articles, "../data/articles.csv")
+saveRDS(articles, "../data/articles.RDS")