From 3555d245cf99dc50fb3274455e9b1ee7ab37b388 Mon Sep 17 00:00:00 2001 From: David Fuhry <david@129a-records.de> Date: Mon, 26 Nov 2018 17:08:51 +0100 Subject: [PATCH] Changes described in previous commit --- README.md | 9 ++------- r/GetData.R | 1 + 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index d888f68..4d448cb 100644 --- a/README.md +++ b/README.md @@ -35,12 +35,7 @@ python -m rasa_core.run -d models/dialogue -u models/current/nlu ### PhysicistsList.R -Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and save them in a file *Physicists.txt* in the data directory. -Use that file to generate xml dump at wikipedias [Export page](https://en.wikipedia.org/wiki/Special:Export) - -### ExtractFromXML.Rasa - -Will read in the xml file from the data directory and extract the title and text of the pages in the dump. Will then write them to *texte.csv* in the data directory, use `read.table` to import. For convenience will also create a texte.RDS file, load with `texte <- readRDS("../data/texte.RDS")`. -**NOTE:** For the script to work, the first line of the xml needs to be replaced with `<mediawiki xml:lang="en">`. +Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and use that list to download the corresponding articles from the wikipedia api. +Will generate a csv containing the gathered articles in the data directory as well as a RDS object containing the data as binary. diff --git a/r/GetData.R b/r/GetData.R index 7c5c1a9..7863a71 100644 --- a/r/GetData.R +++ b/r/GetData.R @@ -72,4 +72,5 @@ articles <- do.call(rbind, articles) write.table(articles, "../data/articles.csv") +saveRDS(articles, "../data/articles.RDS") -- GitLab