Skip to content
Snippets Groups Projects
Commit 783a1ab8 authored by David Fuhry's avatar David Fuhry
Browse files

Minor fixes

parent 52f20fcf
No related branches found
No related tags found
1 merge request!3Add xml extraction script
This commit is part of merge request !3. Comments created here will be created in the context of that merge request.
......@@ -40,7 +40,7 @@ Use that file to generate xml dump at wikipedias [Export page](https://en.wikipe
# ExtractFromXML.Rasa
Will read in the xml file from the data directory and extract the title and text of the pages in the dump. Will then write them to *texte.csv* in the data directory. For convenience will also create a texte.RDS file, load with `texte <- read.RDS("../data/texte.RDS")`.
Will read in the xml file from the data directory and extract the title and text of the pages in the dump. Will then write them to *texte.csv* in the data directory, use `read.table` to import. For convenience will also create a texte.RDS file, load with `texte <- readRDS("../data/texte.RDS")`.
**NOTE:** For the script to work, the first line of the xml needs to be replaced with `<mediawiki xml:lang="en">`.
......@@ -13,4 +13,6 @@ texts <- sapply(text.nodes, xml_text)
df.out <- data.frame(Title = titles,
Text = texts)
write.csv2(df.out, "../data/texte.csv")
saveRDS(df.out, "../data/texte.RDS")
write.table(df.out, "../data/texte.csv")
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment