\documentclass[11p, a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[T1]{fontenc}
\usepackage[]{listings}
\usepackage{hyperref}
\usepackage{graphicx}
\graphicspath{{./img/}}

\title{Text Mining Lab \\ Training Rasa-Chatbots with Text \\ Project Report}
\author{David Fuhry \\ Leonard Haas \\ Lukas Gehrke \\ Lucas Schons \\ Jonas Wolff}
\date{Winter Term 2018/2019}

\begin{document}

\pagenumbering{roman}

\maketitle

\tableofcontents

\pagebreak

\pagenumbering{arabic}

\section{Project Description}

    \subsection{Conversational AI and Training}
    Conversational AI describes computer systems that users can interact with by having a
    conversation. One important goal is to make the conversation seem as natural as possible.
    Ideally, an interaction with the bot should be indistinguishable from one with a human. This
    can make communication with a computer become very pleasant and easy for humans as
    they are simply using their natural language. \par
    Conversational AI can be used in Voice Assistants that communicate through spoken words or
    through chatbots that imitate a human by sending text messages.

    \subsection{Rasa Framework}
    Rasa is a collection of tools for conversational AI software. The \textit{Rasa Stack} consists
    of two open source libraries called \textit{Rasa NLU} and \textit{Rasa Core} that can be used to create contextual
    chatbots. \par
    A Rasa bot needs training data to work properly.

    \subsection{Research Question}
    The objective of this project is to find out, whether chatbots can be trained with natural
    language texts \textit{automatically}. There are two initial research questions:
    \begin{quotation}
        \noindent Can these facts be extracted from natural language text? \\
        Can this be done automatically?
    \end{quotation}

    \subsection{Project Goals}
    In regard to the given research questions, this project aims at implementing procedures to
    extract information from natural language text and make that information accessible to a
    chatbot.
   \begin{enumerate}
        \item Define possible intents fitting the given domain.
        \item Configure a chatbot that recognizes these intents and linked entities.
        \item Acquisition of data and implementation of processing that
        extracts required information.
        \item The chatbot is given access to the extracted data to create answers to given entities
        and intents.
   \end{enumerate}

   Development of the bot is focused on proof of concept instead of production ready
   conversation flow.
   Therefore the natural conversation abilities of the bot will be limited.

\section{Data Processing}

    \subsection{R Package 'wikiproc'}
    All functionality to extract facts, download data from Wikipedia as well as some utility
    functions is encapsulated inside the \textit{wikiproc} R package.
    This allows for a better management of dependencies as well as inclusion of unit tests for fact
    extraction methods.

    \begin{table}[h]
        \centering
        \begin{tabular}{| l | l |}
            \hline
            Function & Category \\ \hline \hline
            clean\_html & Utility \\ \hline
            create\_annotations & Utility \\ \hline
            init\_nlp & Utility \\ \hline
            get\_data & Data scraping \\ \hline
            get\_awards & Fact extraction \\ \hline
            get\_birthdate & Fact extraction \\ \hline
            get\_birthplace & Fact extraction \\ \hline
            get\_spouse & Fact extraction \\ \hline
            get\_university & Fact extraction \\ \hline
        \end{tabular}
        \caption{Exported functions of the \textit{wikiproc} package}
        \label{table:wikiproc_table}
    \end{table}

    \subsection{Data Acquisition}
    Wikipedia was chosen as resource as it provides texts of relatively long length in a
    somewhat uniform manner.
    While Wikipedia does have a \textit{Physicists} category\footnote{\url{https://en.wikipedia.org/wiki/Category:Physicists}},
    it is fragmented into somewhat arbitrary subcategories and thus not optimal to use as a
    collection.
    However, Wikipedia also features a "List of physicists"\footnote{\url{https://en.wikipedia.org/wiki/List_of_physicists}} which contains 981 articles
    that were used to build the corpus. \par
    Data scraping was done using the R package \textit{WikipediR} which is a wrapper around the Wikipedia
    API.
    Articles were downloaded as HTML\footnote{HTML was chosen over wikitext to ease text cleaning}
    and afterwards stripped of all HTML tags and quotation marks.

    \subsection{Fact Extraction}
    Fact extraction approaches greatly vary depending on the nature of the fact to extract.
    As all approaches leverage on some form of NER or POS tagging, annotations were created for all
    texts.
    This was done using the R package \textit{cleanNLP} with a spaCy backend to create NER and POS
    tags, as well as lemmatization. \par
    Fact extraction for physicists spouses was done using pre-defined patterns on word
    lemmata.\footnote{Functionality to use patterns on POS tags is also available but did not yield
    a better outcome.}
    A pattern consists of word lemmata to be matched (including wildcards) as well as defined
    places to look for the name of the physicist and his/her spouse.
    When a matching phrase is found the results are verified by checking that the correct
    physicist is mentioned as well as the potential spouse being detected as a person by the NER
    tagger. \par
    A different approach is used for the detection of awards. The approach is based on the assumption
    that the NER tagger will tag the awards as some kind of entity. A set of keywords is
    then used to extract all entities of interest, namely the awards.

\section{Chatbot Architecture}

    The chatbot built for this project uses both \textit{Rasa Stack} components - \textit{Rasa Core}
    and \textit{Rasa NLU}. The \textit{Rasa NLU} component takes care of getting user input and
    matching it with the respective intents. The \textit{Rasa Core} component executes all actions
    associated with the determined intent. Configuration has been organized in reference to
    examples from the Rasa Github repository\footnote{\url{https://github.com/RasaHQ/rasa_core/tree/master/examples}}. \par
    \textit{Rasa NLU} has been trained with example questions in markdown format that contain highlighted
    entities. This ensures that the bot is able to understand intents and extract the entities
    inside the sentences. One example can be seen in Figure \ref{nlu_example}.

    \begin{figure}[ht]
        \includegraphics[width=\textwidth]{nlu_example}
        \caption{Example for intent 'nationality'}
        \label{nlu_example}
    \end{figure}

    \textit{Rasa Core} has been configured with example conversation flows as training data, called \textit{stories} 
    (Figure \ref{stories_example}) and the \textit{domain} of the bot. The domain
    contains all actions, entities, slots, intents, and templates the bot deals with.
    \textit{Templates} are pattern strings for bot utterances. \textit{Slots} are variables that
    can hold different values. The bot proposed in this project uses a slot to store the name of a
    recognized physicist entity. According to the Rasa website\footnote{\url{https://rasa.com/docs/get_started_step2/}},
    the domain is \textit{the universe the bot is living in}. \\

    \begin{figure}[ht]
        \includegraphics[width=\textwidth]{stories_example}
        \caption{Example for story associated with intent nationality}
        \label{stories_example}
    \end{figure}

    The bot recognizes the intents shown in table \ref{table:intent_table}. It can be started by issuing \textit{MAKE}-commands. For further details,
    refer to the README\footnote{\url{https://git.informatik.uni-leipzig.de/text-mining-chatbot/wiki-rasa/blob/master/README.md}}.

    \begin{center}
        \begin{table}
            \begin{tabular}{| c | l | l |}
                \hline
                No & Intent & Example \\ \hline
                1 & birthdate & When was Albert Einstein born \\ \hline
                2 & nationality & Where was Albert Einstein born \\ \hline
                3 & day of death & When did Albert Einstein die \\ \hline
                4 & place of death & Where did Albert Einstein die \\ \hline
                5 & is alive & Is Albert Einstein still alive \\ \hline
                6 & spouse & Who was Albert Einstein married to \\ \hline
                7 & primary education & Where did Albert Einstein go to school \\ \hline
                8 & university & Which university did Albert Einstein attend \\ \hline
                9 & area of research & What was Albert Einstein area of research \\ \hline
                10 & workplace & Where did Albert Einstein work \\ \hline
                11 & awards & What awards did Albert Einstein win \\ \hline
            \end{tabular}
            \caption{Intents that are recognized by the bot}
            \label{table:intent_table}
        \end{table}
    \end{center}

    The data.tsv-File marks the center of the project architecture (Figure~\ref{project_architecture}) and
    links the bot and the functionality in the \textit{wikiproc} package.
    It is returned by the 'Master.R' script inside of the '/processing' directory which uses the \textit{wikiproc} package
    and contains intents in columns and physicist entities in rows.
    The bot can iterate over the table with custom actions and look for a result matching an intent and an entity.

    \begin{figure}[ht]
        \includegraphics[width=\textwidth]{Wiki_Chatbot_Architecture}
        \caption{Overview of the Project Architecture}
        \label{project_architecture}
    \end{figure}

\section{Results}
Evaluating the Rasa framework we find an ambivalent result.
On the one hand, in the beginning of the project the setup and configuration of the bot led
to considerable problems because of outdated documentation.
Therefore a lot of time had to be spent on trial-and-error procedures to understand the functionality of
the framework. On the other hand, the NLU functionality of the Rasa stack has a high
precision in recognizing intents expressed in the input, even far beyond the provided
training examples. It was possible to configure the bot to meet our needs without any
restrictions. \par
Wikipedia articles are particularly well suited for the process of information extraction,
because they generally are composed consistently. The different levels of detail and therefore information
were an issue in using these articles. \par
Concluding the textmining part of our project we can assess that the functions
using mainly NER tags (get\_awards.R and get\_university.R) have high recall and relatively low
precision. The function get\_spouses.R, which is working with pattern matching, has low recall
and high precision. It needs to be emphasized that the quality the results is strongly dependent by
the provided data. \par
We were thus able to demonstrate that the extraction of facts about pre-defined intents from text to
be used by a chatbot is indeed possible.
We did not address here the automatic generation of new intents from text, which was outside of the scope of this project,
but would make for a logical continuation of our work.

\end{document}