\documentclass[11p, a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[T1]{fontenc}
\usepackage[]{listings}
\usepackage{hyperref}
\usepackage{graphicx}
\graphicspath{{./img/}}

\title{Text Mining Lab \\ Training Rasa-Chatbots with Text \\ Project Report}
\author{David Fuhry \\ Leonard Haas \\ Lukas Gehrke \\ Lucas Schons \\ Jonas Wolff}
\date{Winter Term 2018/2019}

\begin{document}

\maketitle

\tableofcontents

\pagebreak

\section{Project Description}

    \subsection{Conversational AI and Training}
    Conversational AI describes computer systems that users can interact with by having a
    conversation. One important goal is to make the conversation seem as natural as possible.
    Ideally, an interaction with the bot should be indistinguishable from one with a human. This
    can make communication with a computer become very pleasant and easy for humans as
    they are simply using their natural language.
    \\ Conversational AI can be used in Voice Assistants that communicate through spoken words or
    through chatbots that imitate a human by sending text messages.

    \subsection{Rasa Framework}
    Rasa is a collection of tools for conversational AI software. The \textit{Rasa Stack} consists
    of two open source libraries called \textit{Rasa NLU} and \textit{Rasa Core} that can be used to create contextual
    chatbots.
    \\ A Rasa Bot needs training data to work properly.

    \subsection{Research Question}
    The objective of this project is to find out, whether chatbots can be trained with natural
    language texts \textit{automatically}. There are two initial research questions:
    \begin{quotation}
        \noindent Can these facts be extracted from natural language text? \\
        Can this be done automatically?
    \end{quotation}

    \subsection{Project Goals}
    In regard to the given research questions, this project aims at implementing procedures to
    extract information from natural language text and make that information accessible to a
    chatbot.
   \begin{enumerate}
        \item Define possible intents fitting the given domain.
        \item Configure a chatbot that recognizes these intents and linked entities.
        \item Acquisition of data and implementation of processing that
        extracts required information.
        \item The chatbot is given access to the extracted data to create answers to given entities
        and intents.
   \end{enumerate}

   Development of the bot is focused on proof of concept instead of production ready
   conversation flow.
   Therefore the natural conversation abilities of the bot will be limited.
    
\section{Data Processing}

    \subsection{R Package 'wikiproc'}
    All functionality to extract facts, download data from wikipedia as well as some utility
    functions is encapsulated inside the \textit{wikiproc} R package.
    This allows for a better management of dependencies as well as inclusion of unit tests for fact
    extraction methods.

    \begin{table}[h]
        \centering
        \begin{tabular}{| l | l |}
            \hline
            Function & Category \\ \hline \hline
            clean\_html & Utility \\ \hline
            create\_annotations & Utility \\ \hline
            init\_nlp & Utility \\ \hline
            get\_data & Data scraping \\ \hline
            get\_awards & Fact extraction \\ \hline
            get\_birthdate & Fact extraction \\ \hline
            get\_birthplace & Fact extraction \\ \hline
            get\_spouse & Fact extraction \\ \hline
            get\_university & Fact extraction \\ \hline
        \end{tabular}
        \caption{Exported functions of the wikiproc package}
        \label{table:wikiproc_table}
    \end{table}

    \subsection{Data Acquisition}
    Wikipedia was chosen as resource as it provides texts of relatively long length in a
    somewhat uniform manner.
    While Wikipedia does have a \textit{Physicists} category\footnote{\url{https://en.wikipedia.org/wiki/Category:Physicists}}, 
    it is fragmented into somewhat arbitrary subcategories and thus not optimal to use as a
    collection.
    However Wikipedia also has a \textit{List of physicists}\footnote{\url{https://en.wikipedia.org/wiki/List_of_physicists}} which contains 981 physicists and was
    used to build the collection used. \\
    Data scraping was done using the R Package \textit{WikipediR}, a wrapper around the Wikipedia
    API.
    Articles were downloaded as HTML\footnote{HTML was chosen over wikitext to ease text cleaning}
    and afterwards stripped of all HTML tags and quotation marks.
    
    \subsection{Fact Extraction}
    Fact extraction approaches greatly vary depending on the nature of the fact to extract.
    As all approaches leverage on some form of NER or POS tagging, annotations were created for all
    texts.
    This was done using the R Package \textit{cleanNLP} with a spaCy backend to create NER and POS
    tags, as well as lemmatization. \\
    Fact extraction for physicists spouses was done using pre-defined patterns on word
    lemmata.\footnote{Functionality to use patterns on POS tags is also available but did not yield
    a better outcome.}
    A pattern consists of word lemmata to be matched (including wildcards) as well as defined
    places to look for the name of the physicist and his/her spouse.
    When a matching phrase is found the results are verified by checking that the correct
    physicist is mentioned as well as the potential spouse being detected as a person by the NER
    tagger.
    A different approach is used for the get\_awards() function. The approach is based on the assumption that the NER tagger will tag the awards as some kind of entity. A set of keywords is
    the used to extract all entities of interest, the awards.

\section{Chatbot Architecture}

    The chatbot built for this project uses both Rasa Stack components - \textit{Rasa Core}
    and \textit{Rasa NLU}. The \textit{Rasa NLU} component takes care of getting user input and
    matching it with the respective intents. The \textit{Rasa-Core} component executes all actions
    associated with the determined intent. Configuration has been organized in reference to
    examples from the Rasa github repository\footnote{\url{https://github.com/RasaHQ/rasa_core/tree/master/examples}}. \\
    Rasa NLU has been trained with example questions in markdown format that contain highlighted
    entities. This ensures that the bot is able to understand intents and extract the entities
    inside the sentences. One example can be seen in listing \ref{nlu_example}.

    \includegraphics[width=\textwidth]{nlu_example}

    Rasa Core has been configured with \textit{stories} that contain example conversation flows as
    training data (listing \ref{stories_example}) and the \textit{domain} of the bot. The domain
    contains all actions, entities, slots, intents, and templates the bot deals with.
    \textit{Templates} are pattern strings for bot utterances. \textit{Slots} are variables that
    can hold different values. The bot proposed in this project uses a slot to store the name of a
    recognized physicist entity. According to the Rasa website\footnote{\url{https://rasa.com/docs/get_started_step2/}},
    the domain is \textit{the universe the bot is living in}. \\

    \includegraphics[width=\textwidth]{stories_example}

    The bot recognizes the intents shown in table \ref{table:intent_table}. It can be started by issuing \textit{MAKE}-commands. For further details,
    refer to the README
    \footnote{
    \url{https://git.informatik.uni-leipzig.de/text-mining-chatbot/wiki-rasa/blob/master/README.md}}.

    \begin{center}
        \begin{table}
            \begin{tabular}{| c | l | l |}
                \hline
                No & Intent & Example \\ \hline
                1 & birthdate & When was Albert Einstein born \\ \hline
                2 & nationality & Where was Albert Einstein born \\ \hline
                3 & day of death & When did Albert Einstein die \\ \hline
                4 & place of death & Where did Albert Einstein die \\ \hline
                5 & is alive & Is Albert Einstein still alive \\ \hline
                6 & spouse & Who was Albert Einstein married to \\ \hline
                7 & primary education & Where did Albert Einstein go to school \\ \hline
                8 & university & Which university did Albert Einstein attend \\ \hline
                9 & area of research & What was Albert Einstein area of research \\ \hline
                10 & workplace & Where did Albert Einstein work \\ \hline
                11 & awards & What awards did Albert Einstein win \\ \hline
            \end{tabular}
            \caption{Intents that are recognized by the bot}
            \label{table:intent_table}
        \end{table}
    \end{center}

    \includegraphics[width=\textwidth]{Wiki_Chatbot_Architecture}


\section{Results}
Evaluating the Rasa framework we find an ambivalent result.
On the one hand, in the beginning of the project the setup and configuration of the bot led 
to considerable problems because of outdated documentation.
Therefore a lot of time had to be spent on trial-and-error procedures to understand the functionality of
the framework. On the other hand, the NLU functionality of the Rasa stack has a high
precision in recognizing intents expressed in the input, even far beyond the provided
training examples. It was possible to configure the bot to meet our needs without any
restrictions. \par
Wikipedia articles are particularly well suited for the process of information extraction,
because they generally are composed consistently. The different levels of detail and therefore information were an issue when dealing in using these articles. \par
Concluding the textmining part of our project we can assess that the functions
using mainly NER tags (get\_awards.R and get\_university.R) have high recall and relatively low 
precision. The function get\_spouses.R, which is working with pattern matching, has low recall
and high precision. It needs to be emphasized that the quality the results is strongly dependent by
the provided data. \par
In regards to answering our research question: We were able to extract facts about 
pre-determined intents from text and make them available to the bot. The next logical step, which
we were unable to address in the context of this project, would be generating intents automatically
from text.

\end{document}