Skip to content
Snippets Groups Projects
README.md 4.8 KiB
Newer Older
Lucas Schons's avatar
Lucas Schons committed
# Wiki Rasa

## Overview

This repository contains all files required to download data from Wikipedia, process that data to extract facts about physicists and build a chatbot based on the rasa framework with that information.

### Prerequisites

The R script assumes all the packages in the `packages.list` file are installed within R. You may do this with:

```{R}
install.packages(readLines('/processing/packages.list'))
```

Furthermore you will need to have an spacy installation with the english language data installed. By default the script will assume to find this in a conda environment named `spcy`, if you need to change that do so in the `Master.R` file.

To build the **wikiproc** package navigate to the processing directory and run:

```bash
R CMD build wikiproc
R CMD INSTALL wikiproc_<version>.tar.gz
```

_Note: This will require the [R Tools](https://cran.r-project.org/bin/windows/Rtools/) on windows and possibly additional packages on *nix platforms._

To run the rasa bot rasa will need to be installed. It is recommended to do that in a conda environment, you may create one with:

```{bash}
conda create -n rasa_env python=3.6.7
source activate rasa_env
```

In this environment install rasa_nlu, rasa_core, sklean_crfsuite and spacy. Also download the spacy en_core_web_md language data.

```{bash}
pip install rasa_nlu
pip install rasa_core
pip install sklearn_crfsuite
pip install spacy
python -m spacy download en_core_web_md
python -m spacy link en_core_web_md en
```

### Running

The data processing side is done by the `Master.R` script in the `processing/script` folder.  The script assumes the working direcory to be somewhere within the base directory `wiki-rasa` so make sure to either call `Rscript` from within this directory or to set the working directory in R here prior to sourcing. Easiest way is to call the script from the base directory of the repository:

```{bash}
Rscript processing/script/Master.R
```

This will download the required data, process it and generate the data file required for the chat bot. After that train the bot (don't forget to activate the conda environment if you're using one).

```{bash}
cd rasa/
make train
```

You're ready to run the bot.

```{bash}
make run
```

### Installing on debian

For a detailed guide on installing on a Debian 9 machine take a look at [Installation](INSTALL.md).

### Building the docker

**_Work in progress_**

Run the build script for your system, e.g. on Windows `build_docker.bat` or `build_docker.sh` on Linux.

After that you should be good to start the docker with

_Note: This will do all processing including data download in the docker container and thus results in a rather large container.
Container size will be reduced in the future_

```{bash}
docker run -it chatbot
```

## Contributing
Before merging please make sure to check the following:
* If your script uses any libraries check if they are in `packages.list` and if not add them
* Does your contribution require any additional configuration? If so please update `README.md` and `docs/install_debian.md`
  * If your changes need any system level changes, make sure to also add these in `Dockerfile` and `install.sh`
* Please make sure the wikiproc package can be build by calling `devtools::document()` as well as `R CMD build wikiproc` and possibly also `devtools::check()`
### Writing custom feature extraction functions
When writing a function to extract a feature use the following as guidelines:
* Place your file in the `processing/wikiproc/R` folder with an appropriate name
* Add a function call to `master.R` within the main apply function
  * The parameters you hand to your function here will determine what you may work with
    * `article[1]` is the name of the physicits
    * `article[2]` and `article[3]` contain the page and revision id respectivly
    * `article[4]` contains the raw html text of the article
    * `cleaned.text` for the cleaned text
    * `annotations` contains the cleanNLP annotation object, to access it use the clnp_get functions. See [here](https://cran.r-project.org/web/packages/cleanNLP/cleanNLP.pdf) for help.
    * You may use additional parameters to your liking
  * Your function will allways be given data for a single article you do not need to make your function vectorized
* Bind the output of your function to the resutls data frame at the very end of the main apply function
* Please don't use library imports, if possible call the functions explicitly via `::`. If you need to load a library do so in `import_packages.R`.

### Steps to build

* Make sure your functions are properly commented for roxygen
  * If your function is to be visible from the outside, make sure to add `@export` to the roxygen comment
* Set the working directory to `wikiproc` and call `devtools::document()`
* Step into `processing` and use `devtools::install("wikiproc")` to install the package