diff --git a/README.md b/README.md index 07ce035bf618142d04c2d9daba6593ae58344930..df8de47b0334e8bc08072361a6c289d649834d37 100644 --- a/README.md +++ b/README.md @@ -40,7 +40,7 @@ Use that file to generate xml dump at wikipedias [Export page](https://en.wikipe # ExtractFromXML.Rasa -Will read in the xml file from the data directory and extract the title and text of the pages in the dump. Will then write them to *texte.csv* in the data directory. For convenience will also create a texte.RDS file, load with `texte <- read.RDS("../data/texte.RDS")`. +Will read in the xml file from the data directory and extract the title and text of the pages in the dump. Will then write them to *texte.csv* in the data directory, use `read.table` to import. For convenience will also create a texte.RDS file, load with `texte <- readRDS("../data/texte.RDS")`. **NOTE:** For the script to work, the first line of the xml needs to be replaced with `<mediawiki xml:lang="en">`. diff --git a/r/ExtractFromXML.R b/r/ExtractFromXML.R index ac13f0b5f1385ad4989319d50effc069a5824d9d..2af3bc4230664f514f71a2e68189792a0bdcb4cb 100644 --- a/r/ExtractFromXML.R +++ b/r/ExtractFromXML.R @@ -13,4 +13,6 @@ texts <- sapply(text.nodes, xml_text) df.out <- data.frame(Title = titles, Text = texts) -write.csv2(df.out, "../data/texte.csv") +saveRDS(df.out, "../data/texte.RDS") + +write.table(df.out, "../data/texte.csv") \ No newline at end of file