From 3555d245cf99dc50fb3274455e9b1ee7ab37b388 Mon Sep 17 00:00:00 2001
From: David Fuhry <david@129a-records.de>
Date: Mon, 26 Nov 2018 17:08:51 +0100
Subject: [PATCH] Changes described in previous commit

---
 README.md   | 9 ++-------
 r/GetData.R | 1 +
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index d888f68..4d448cb 100644
--- a/README.md
+++ b/README.md
@@ -35,12 +35,7 @@ python -m rasa_core.run -d models/dialogue -u models/current/nlu
 
 ### PhysicistsList.R
 
-Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and save them in a file *Physicists.txt* in the data directory.
-Use that file to generate xml dump at wikipedias [Export page](https://en.wikipedia.org/wiki/Special:Export)
-
-### ExtractFromXML.Rasa
-
-Will read in the xml file from the data directory and extract the title and text of the pages in the dump. Will then write them to *texte.csv* in the data directory, use `read.table` to import.  For convenience will also create a texte.RDS file, load with `texte <- readRDS("../data/texte.RDS")`.
-**NOTE:** For the script to work, the first line of the xml needs to be replaced with `<mediawiki xml:lang="en">`.
+Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and use that list to download the corresponding articles from the wikipedia api.
+Will generate a csv containing the gathered articles in the data directory as well as a RDS object containing the data as binary.
 
 
diff --git a/r/GetData.R b/r/GetData.R
index 7c5c1a9..7863a71 100644
--- a/r/GetData.R
+++ b/r/GetData.R
@@ -72,4 +72,5 @@ articles <- do.call(rbind, articles)
 
 write.table(articles, "../data/articles.csv")
 
+saveRDS(articles, "../data/articles.RDS")
 
-- 
GitLab