diff --git a/INSTALL.md b/INSTALL.md new file mode 100644 index 0000000000000000000000000000000000000000..1c196d79b70889b096647caf2abf01f07a87768d --- /dev/null +++ b/INSTALL.md @@ -0,0 +1,127 @@ +# Install instructions + +This provides instructions for seting up the software on a freshly installed debian 9 system. It will most likely work on any recent ubuntu system too, though there may be some hickup with the python versions. + +## Installing Debian + +This assumes a standard install of debian was made using the [smallcd AMD64](https://www.debian.org/distrib/netinst#smallcd) debian image. It was tested selecting only the base system with the standard system utilities (which contain python) and no gui. +This guide assumes during setup a user named rasa was created, though this shouldn't be too hard to adapt to. + +### Hypervisor specific steps + +#### Hyper-V + +Nothing to do, works out of the box. + +#### KVM + +Not tested. + +#### VirtualBox + +Works. + +## Installing sudo + +Though not required we'll make rasa a sudoer for convenience reasons. + +First log in as root and run + +```shell +apt-get install sudo +``` + +Next we'll make the `rasa` user a sudoer + +```shell +usermod -aG sudo rasa +``` + +All done here. `exit` and log in as rasa. + +## Seting up python for cleanNLP + +Just to make sure we update the system with. We'll also need gcc nad git, so go ahead and install em. + +```shell +sudo apt-get update && sudo apt-get dist-upgrade -y && sudo apt-get install gcc git build-essential python-dev -y +``` + +Next, install miniconda: + +```shell +wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh +bash Miniconda3-latest-Linux-x84_64.sh +``` + +Defaults are fine here. + +Log out and back in. + +Now we create an environment for spacy and install it: + +```shell +conda create -n spcy python=3 +conda activate spcy +pip install spacy +python -m spacy download en +conda deactivate +``` + +## Installing R + +_There is a script that will do all these things for you. If you want to use it skip ahead to **Cloning the project** and be sure to execute the script as described there_ + +We need to add the cran repository to sources.list as the r packages in the debian repositories are somewhat out of date. + +For that we'll need a few packages + +```shell +sudo apt install dirmngr --install-recommends +sudo apt install software-properties-common apt-transport-https -y +``` + +Now we'll add the key for the cran ppa and add the ppa + +```shell +sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' +sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/debian stretch-cran35/' +``` + +Finally we may install R + +```shell +sudo apt-get update +sudo apt-get install r-base-dev +``` + +While we're at it, we install a few more things we need for some R packages and also git. + +```shell +sudo apt-get install libcurl4-openssl-dev libssl-dev libxml2-dev git -y +``` + +## Cloning the project + +Run: + +```shell +git clone https://git.informatik.uni-leipzig.de/text-mining-chatbot/wiki-rasa.git +cd wiki-rasa +``` + +_If skipping the steps above run the install script now._ + +```shell +./install.sh +``` + +## Installing R Packages + +This needs to be done from an Interactive R console as R will ask wheather to use an personal library the first time installing packages. To do this, open R and type the following: + +```r +install.packages(readLines("packages.list")) +``` + +This will install all the packages required. When asked if you want to use a personal library say yes and accept the defaults. \ No newline at end of file diff --git a/README.md b/README.md index 4d448cbc7a6ed2727f6816f170bf414575622391..d0b0e4a8cbb46cb8f2d147e3f5b6d1170ea2be5e 100644 --- a/README.md +++ b/README.md @@ -1,41 +1,37 @@ # Wiki Rasa -### Installation -2 Optionen: +## Contributing -1. Option: Python 3.6.6 installiert haben oder downgraden von 3.7 (wird von Tensorflow noch nicht unterstützt) -Dann rasa core mit ```pip install rasa_core``` und rasa nlu mit ```pip install rasa_nlu``` installieren. -2. Option: Anaconda installieren, eine Python 3.6.6 Umgebung erstellen und dann rasa installieren. +Before merging please make sure to check the following: +* If your script uses any libraries check if they are in `packages.list` and if not add them +* Does your contribution require any additional configuration? If so please update `README.md` and `INSTALL.md` + * Some R packages require system level libraries on OS X and Linux, if that is the make sure they are added in `INSTALL.md` and also in `install.sh` -### Example Project zum laufen bringen -[stories.md](https://github.com/RasaHQ/rasa_core/blob/master/examples/moodbot/data/stories.md), [domain.yml](https://github.com/RasaHQ/rasa_core/blob/master/examples/moodbot/domain.yml), [nlu.md](https://github.com/RasaHQ/rasa_core/blob/master/examples/moodbot/data/nlu.md) downloaden. -```nlu_config.yml``` mit folgendem Inhalt erstellen: -```{md} -language: en -pipeline: tensorflow_embedding -``` +### Writing custom feature extraction functions -Dann kann das Modell trainiert werden mit: -``` -# rasa core -python -m rasa_core.train -d domain.yml -s stories.md -o models/dialogue +When writing a function to extract a feature use the following as guidelines: +* Place your file in the `r` folder with an appropriate name +* Add a function call to `Master.R` within the main apply function + * The parameters you hand to your function here will determine what you may work with + * `article[1]` is the name of the physicits + * `article[2]` and `article[3]` contain the page and revision id respectivly + * `article[4]` contains the raw html text of the article + * `cleaned.text` for the cleaned text + * `annotations` contains the cleanNLP annotation object, to access it use the clnp_get functions. See [here](https://cran.r-project.org/web/packages/cleanNLP/cleanNLP.pdf) for help. + * You may use additional parameters to your liking + * Your function will allways be given data for a single article you do not need to make your function vectorized +* Bind the output of your function to the resutls data frame at the very end of the main apply function -# Natural Language processing -python -m rasa_nlu.train -c nlu_config.yml --data nlu.md -o models --fixed_model_name nlu --project current --verbose -``` -Danach kann man mit dem Bot reden mit: -``` -python -m rasa_core.run -d models/dialogue -u models/current/nlu -``` +## Installation +### General prerequisites -# R Scripts +The script assumes all the packages in the `packages.list` file are installed within R. Furthermore you will need to have an spacy installation with the english language data installed. By default the script will assume to find this in a conda environment named `spcy`, if you need to change that do so in the `ProcessNER.R` file. -### PhysicistsList.R - -Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and use that list to download the corresponding articles from the wikipedia api. -Will generate a csv containing the gathered articles in the data directory as well as a RDS object containing the data as binary. +For a detailed guide on installing on a Debian 9 machine take a look at [Installation](INSTALL.md). +## Running +The data processing side is done by the `Master.R` script in the `r` folder. This may be called via `Rscript r/Master.R` from any command line or via `source("r/Master.R")` from within R. The script assumes the working direcory to be the base directory `wiki-rasa` so make sure to either call `Rscript` from within this directory or to set the working directory in R here prior to sourcing. diff --git a/install.sh b/install.sh new file mode 100755 index 0000000000000000000000000000000000000000..c429b7e77a6b5ff8314f9678d79830e739597555 --- /dev/null +++ b/install.sh @@ -0,0 +1,9 @@ +#!/usr/bin/env bash +sudo apt-get update && sudo apt-get dist-upgrade -y +sudo apt install dirmngr --install-recommends +sudo apt install software-properties-common apt-transport-https -y +sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' +sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/debian stretch-cran35/' +sudo apt-get update +sudo apt-get install r-base-dev -y +sudo apt-get install libcurl4-openssl-dev libssl-dev libxml2-dev -y \ No newline at end of file diff --git a/packages.list b/packages.list new file mode 100644 index 0000000000000000000000000000000000000000..d61e2fb55d01e0b19aaa9b98f33b84328257a7d5 --- /dev/null +++ b/packages.list @@ -0,0 +1,10 @@ +pbapply +rvest +stringi +textclean +stringr +data.table +xml2 +WikipediR +reticulate +cleanNLP \ No newline at end of file