Compare revisions

Lukas Gehrke · David Fuhry · David Fuhry · David Fuhry · David Fuhry · Lucas Schons
--- a/.gitignore
+++ b/.gitignore
+*.csv
+*.RDS
\ No newline at end of file
--- a/README.md
+++ b/README.md
@@ -31,3 +31,16 @@ python -m rasa_core.run -d models/dialogue -u models/current/nlu
 ```


+# R Scripts
+
+### PhysicistsList.R
+
+Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and save them in a file *Physicists.txt* in the data directory.
+Use that file to generate xml dump at wikipedias [Export page](https://en.wikipedia.org/wiki/Special:Export)
+
+### ExtractFromXML.Rasa
+
+Will read in the xml file from the data directory and extract the title and text of the pages in the dump. Will then write them to *texte.csv* in the data directory, use `read.table` to import.  For convenience will also create a texte.RDS file, load with `texte <- readRDS("../data/texte.RDS")`.
+**NOTE:** For the script to work, the first line of the xml needs to be replaced with `<mediawiki xml:lang="en">`.
+
+
--- a/data/Wikipedia-20181120103842.xml
+++ b/data/Wikipedia-20181120103842.xml
-<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
+<mediawiki xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
--- a/docs/protokolle/2018-11-20.md
+++ b/docs/protokolle/2018-11-20.md
+# Treffen 20.11.2018
+
+* Ort: P90x IR-Lab
+* Anwesende: alle
+* Beginn 10:00 Uhr
+* Ende 12:00 Uhr
+
+## Agenda:
+
+    * Rasa lokal zum laufen bringen
+    * Daten sichten
+    * Intents für Rasa sammeln
+    * Ausblick
+
+
+## Rasa zum Laufen bringen
+
+    * Die Gruppenmitglieder testen eine Rasa-Version auf ihren Laptops
+
+## Daten sichten
+
+    * Der Bot soll für englische Sprache gebaut werden
+    * Das Thema sind "Physiker"
+
+    * Ziel: automatisiert Daten runterladen
+
+### Akquise-Möglichkeit 1
+
+    * Wikipedia export - eine person muss ganzen Wikipedia Dumb runterladen
+        * Daraus nur die Physiker extrahieren und unter den Gruppenmitgliedern verteilen
+
+### Akquise-Möglichkeit 2
+
+    * https://en.wikipedia.org/wiki/Special:Export
+        * David wird über diese Schnittstelle ein `xml` File zu grob 900 Physikern runterladen
+        * Das XML ist bereits runtergeladen und in der Slack-Gruppe verfügbar
+
+## Intents aufschreiben
+
+    * Die Gruppe diskutiert über Intents
+        * Was meinen wir mit dem was wir dem Bot sagen?
+        * Was wollen wir von dem Bot als Antwort hören?
+
+### Ein paar Ideen zu Intents
+
+    * Jonas erstellt eine Datei im repo, in der Intents gespeichert werden sollen
+        * In die Datei sollen die Gruppenmitglieder Beispiel-Intents reinschreiben
+        * Dabei soll sich an die [Vorlage von Rasa](https://rasa.com/docs/nlu/dataformat/) gehalten werden
+
+    * Eltern
+    * Ausbildungsstelle
+    * Lehrstuhl
+    * Preise (Nobelpreis) bzw. Ehrungen
+    * Forschungsschwerpunkte
+
+### Die Gruppe sammelt gemeinsam Intents
+
+    * die oben erwähnte Datei wird mit Intents am Beispiel `Albert Einstein` erstellt
+
+## Ausblick
+
+### Zu verteilende Aufgaben
+
+    * Überprüfen, ob Rasa NLU die gesammelten Intents richtig erkennen kann
+        * evtl auch mit mehr Beispielnahmen als nur Albert Einstein (Lukas)
+
+    * Anschauen, wie man das XML in R reinbekommt und Wissen extrahieren kann oder das Dokument strukturieren kann (David, Leonard)
+        * Ziel dabei: Wissensgrundlage erstellen
+
+### Nächstes Treffen
+
+    * 26.11. ab 17:00 Uhr in der neunten Etage
+
+
--- a/r/ExtractFromXML.R
+++ b/r/ExtractFromXML.R
+#!/usr/bin/env Rscript
+
+library(xml2)
+
+data <- read_xml("../data/Wikipedia-20181120103842.xml")
+
+title.nodes <- xml_find_all(data, ".//title")
+
+titles <- sapply(title.nodes, xml_text)
+
+text.nodes <- xml_find_all(data, ".//text")
+
+texts <- sapply(text.nodes, xml_text)
+
+df.out <- data.frame(Title = titles,
+                     Text = texts)
+
+saveRDS(df.out, "../data/texte.RDS")
+
+write.table(df.out, "../data/texte.csv")
\ No newline at end of file
--- a/r/Master.R
+++ b/r/Master.R
+#!/usr/bin/env Rscript
+
+
+### This script consolidates everything
+
+## Librarys
+
+#library(SomeLibrary)
+
+## Load Scripts
+
+cat("Sourcing R scripts... ")
+
+source("r/GetData.R")
+#source("r/getBirthday.R")
+#source("r/getSomethingElse.R")
+
+cat("Done.\n")
+
+## Fetch data
+
+cat("Starting data import...\n")
+
+articles <- getData(use.cache = TRUE)
+
+## Data processing
+
+cat("Processing data...\n")
+
+results <- lapply(articles, function(data) {
+  ## Data cleaning
+  
+  # cleaned.text <- someCleanFunctioN(data$Text)
+  
+  ## Data preprocessing/annotating
+  
+  # annotated.text <- annotationFunction(data$Text)
+  
+  ## Extract information from Text
+  
+  # someFact <- getFactFromTextFunctioN(annotated.text)
+  
+  # someOtherFact <- getOtherFactFromText(data$Text)
+  
+  ## Create Results
+  
+  # data.frame(Name = x$Name,
+  #            FactOne = someFact,
+  #            FactTwo = someOtherFact)
+  
+})
+
+results <- do.call(rbind, results)
+
+## Results are now in results
+
+## Format for rasa 
+
+# someFormatFunction(results)
--- a/r/PhysicistsList.R
+++ b/r/PhysicistsList.R
+#!/usr/bin/env Rscript
+
 ### Extract list of pyhsicists from wikipedia article

 library(rvest)
@@ -17,5 +19,5 @@ physicists <- physicists[nchar(physicists) > 5]
 length(physicists) <- length(physicists) - 3

 # Done
-write(physicists, "physicists.txt")
+write(physicists, "../data/physicists.txt")
No results found