This provides instructions for seting up the software on a freshly installed debian 9 system. It will most likely work on any recent ubuntu system too, though there may be some hickup with the python versions.
## Installing Debian
This assumes a standard install of debian was made using the [smallcd AMD64](https://www.debian.org/distrib/netinst#smallcd) debian image. It was tested selecting only the base system with the standard system utilities (which contain python) and no gui.
This guide assumes during setup a user named rasa was created, though this shouldn't be too hard to adapt to.
### Hypervisor specific steps
#### Hyper-V
Nothing to do, works out of the box.
#### KVM
Not tested.
#### VirtualBox
Works.
## Installing sudo
Though not required we'll make rasa a sudoer for convenience reasons.
First log in as root and run
```shell
apt-get install sudo
```
Next we'll make the `rasa` user a sudoer
```shell
usermod -aGsudo rasa
```
All done here. `exit` and log in as rasa.
## Seting up python for cleanNLP
Just to make sure we update the system with. We'll also need gcc nad git, so go ahead and install em.
Now we create an environment for spacy and install it:
```shell
conda create -n spcy python=3
conda activate spcy
pip install spacy
python -m spacy download en
conda deactivate
```
## Installing R
_There is a script that will do all these things for you. If you want to use it skip ahead to **Cloning the project** and be sure to execute the script as described there_
We need to add the cran repository to sources.list as the r packages in the debian repositories are somewhat out of date.
_If skipping the steps above run the install script now._
```shell
./install.sh
```
## Installing R Packages
This needs to be done from an Interactive R console as R will ask wheather to use an personal library the first time installing packages. To do this, open R and type the following:
```r
install.packages(readLines("packages.list"))
```
This will install all the packages required. When asked if you want to use a personal library say yes and accept the defaults.
When writing a function to extract a feature use the following as guidelines:
* Place your file in the `r` folder with an appropriate name
* Add a function call to `Master.R` within the main apply function
* The parameters you hand to your function here will determine what you may work with
*`article[1]` is the name of the physicits
*`article[2]` and `article[3]` contain the page and revision id respectivly
*`article[4]` contains the raw html text of the article
*`cleaned.text` for the cleaned text
*`annotations` contains the cleanNLP annotation object, to access it use the clnp_get functions. See [here](https://cran.r-project.org/web/packages/cleanNLP/cleanNLP.pdf) for help.
* You may use additional parameters to your liking
* Your function will allways be given data for a single article you do not need to make your function vectorized
* Bind the output of your function to the resutls data frame at the very end of the main apply function
The script assumes all the packages in the `packages.list` file are installed within R. Furthermore you will need to have an spacy installation with the english language data installed. By default the script will assume to find this in a conda environment named `spcy`, if you need to change that do so in the `ProcessNER.R` file.
### PhysicistsList.R
Will crawl wikipedias [List of Physicists](https://en.wikipedia.org/wiki/List_of_physicists) for all physicist names and use that list to download the corresponding articles from the wikipedia api.
Will generate a csv containing the gathered articles in the data directory as well as a RDS object containing the data as binary.
For a detailed guide on installing on a Debian 9 machine take a look at [Installation](INSTALL.md).
## Running
The data processing side is done by the `Master.R` script in the `r` folder. This may be called via `Rscript r/Master.R` from any command line or via `source("r/Master.R")` from within R. The script assumes the working direcory to be the base directory `wiki-rasa` so make sure to either call `Rscript` from within this directory or to set the working directory in R here prior to sourcing.