\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[T1]{fontenc}
\usepackage{ngerman}
\usepackage[]{listings}
\usepackage{hyperref}
\usepackage{graphicx}
\graphicspath{{./img/}}

\title{Text Mining Lab \\ Training Rasa-Chatbots with Natural Language Texts \\ Project Report}
\author{David Fuhry \\ Leonard Haas \\ Lukas Gehrke \\ Lucas Schons \\ Jonas Wolff}
\date{Winter Term 2018/2019}

\begin{document}

\maketitle

\tableofcontents

\pagebreak

\section{Project Description}

    \subsection{Conversational AI and Training}
    Conversational AI describes computer systems that users can interact with by having a
    conversation. One important goal is to make the conversation seem as natural as possible.
    Ideally, an interacting user should assume to be interacting with another human. This
    can make communication with a computer become very pleasant and easy for humans as
    they are simply using their natural language. Besides there is no need for menu
    interaction with the system and thus no learning curve.
    \\ Conversational AI can be used in Voice Assistants that communicate through spoken words or
    through chatbots that imitate a human by sending text messages.

    \subsection{Rasa Framework}
    Rasa is a collection of tools for conversational AI software. The Rasa Stack contains two
    open source libraries called Rasa NLU and Rasa Core that can be used to create contextual
    chatbots. Rasa NLU is a library for natural language understanding with intent classification
    and entity extraction Rasa Core is a chatbot framework with machine learning based dialogue
    management. Both can be uses independently but rasa recommends using both.
    \\ A Rasa Bot needs training data to work properly. The NLU component must be provided with example questions for each \textit{intent} it will have to deal with. Inside of these questions, \textit{entities} must be marked in order to train rasa where to extract these from.
    The Core component requires example conversation flows and utterance templates for training. Examples can be seen in \ref{rasa_chatbot}.

    \subsection{Research Question}
    The objective of this project is to find out, wether chatbots can be trained with natural
    language texts \textit{automatically}. There are two initial research questions: Given that
    chatbots need to be trained with knowledge, called facts:
    \begin{itemize}
        \item can these facts be extracted from natural language text?
        \item can this be done automatically? 
    \end{itemize}
    
\section{Approach}

    \subsection{Project Goals}

    \subsection{Rasa Setup and Intents}

    \subsection{Scrapping of Source Texts}
    Wikipedia was chosen as resource for texts as it provides texts of relatively long length in a somewhat uniform manner.
    While Wikipedia does have a \textit{Physicists} category\footnote{\url{https://en.wikipedia.org/wiki/Category:Physicists}}, 
    it is fragmented into somewhat arbitrary subcategories and thus not optimal to use as a collection.
    However Wikipedia also has a \textit{List of physicists} which contains 981 physicists and was used to build the collection used. \\
    Data scraping was done using the R Package \textit{WikipediR}, a wrapper around the Wikipedia API.
    Articles were downloaded as HTML\footnote{HTML was chosen over wikitext to ease text cleaning} and afterwards strapped of all HTML Tags and Quotation marks. 
    
    \subsection{Fact Extraction Approaches}
    Fact extraction greatly varies depending on the nature of the fact to extract.
    As all approaches leverage on some form of NER or POS tagging, annotations were created for all text.
    This was done using the R Package \textit{cleanNLP} with an spaCy backend to create NER and POS tags, as well as lemmatization. \\
    Fact extraction for physicists spouses was done using pre-defined patterns on word lemmata.\footnote{Functionality to use patterns on POS Tags is also available but did not yield a better outcome.}
    A pattern is consists of word lemmata to be matched (including wildcards) as well as defined places to look for the name of the physicist as well as his/her spouse.
    When a matching phrase is found the results are verified by checking that the corresponding physicist is mentioned as well as the potential spouse being detected as a Person by the NER tagger.

\section{Software Architecture}

    \subsection{Rasa Chatbot} \label{rasa_chatbot}
    The chatbot built for this project uses both Rasa Stack components - \textit{Rasa Core}
    and \textit{Rasa NLU}. Configuration has been organized in reference to examples from the Rasa
    github repository. \\ Rasa NLU has been trained with example questions in markdown format that
    contain highlighted entities. This ensures that the bot is able to understand intents and
    extract the entities inside the sentences. One example can be seen in listing \ref{nlu_example}. \\

    \lstinputlisting[label={nlu_example}, caption={NLU example}]{nlu_example.md}

    Rasa Core has been configured with \textit{stories} that contain example conversation flows as
    training data (listing \ref{stories_example}) and the \textit{domain} of the bot. The domain
    contains all actions, entities, slots, intents, and templates the bot deals with. \textit
    {Templates} means template strings for bot utterances. \textit{Slots} are variables that can
    hold different values. The bot proposed in this project uses a slot to store the name of a
    recognized physicist entity. According to the Rasa website
    \footnote{\url{https://rasa.com/docs/get_started_step2/}}
    , the domain is \textit{the universe the bot is living in}. \\

    \lstinputlisting[label={stories_example}, caption={Example Story}]{stories_example.md}

    The bot recognizes the intents shown in the table on page \pageref
    {table:intent_table}. It can be started through
    \textit{MAKE}-commands. For further details, please refer to the README
    \footnote{
    \url{https://git.informatik.uni-leipzig.de/text-mining-chatbot/wiki-rasa/blob/master/README.md}}

    Development of the bot was focused on proof of concept so there is not a lot of natural
    conversation ability available.

    \begin{center}
        \begin{table}
            \begin{tabular}{| c | l | l |}
                \hline
                No & Intent & Example \\ \hline
                1 & birthdate & When was Albert Einstein born \\ \hline
                2 & nationality & Where was Albert Einstein born \\ \hline
                3 & day of death & When did Albert Einstein die \\ \hline
                4 & place of death & Where did Albert Einstein die \\ \hline
                5 & is alive & Is Albert Einstein still alive \\ \hline
                6 & spouse & Who was Albert Einstein married to \\ \hline
                7 & primary education & Where did Albert Einstein go to school \\ \hline
                8 & university & Which university did Albert Einstein attend \\ \hline
                9 & area of research & What was Albert Einstein area of research \\ \hline
                10 & workplace & Where did Albert Einstein work \\ \hline
                11 & awards & What awards did Albert Einstein win \\ \hline
            \end{tabular}
            \caption{Intents that are recognized by the bot}
            \label{table:intent_table}
        \end{table}
    \end{center}

    \subsection{R Package 'wikiproc'}
    All functionality to extract facts, download data from wikipedia as well as some utility functions 
    is encapsulated inside the \textit{wikiproc} R Package. 
    This allows for a better management of dependencies as well as inclusion of unit tests for fact extraction methods.
    

    \begin{table}
        \centering
        \begin{tabular}{| l | l |}
            \hline
            Function & Category \\ \hline \hline
            clean\textunderscore html & Utility \\ \hline
            create\textunderscore annotations & Utility \\ \hline
            init\textunderscore nlp & Utility \\ \hline
            get\textunderscore data & Data scraping \\ \hline
            get\textunderscore awards & Fact extraction \\ \hline
            get\textunderscore birthdate & Fact extraction \\ \hline
            get\textunderscore birthplace & Fact extraction \\ \hline
            get\textunderscore spouse & Fact extraction \\ \hline
            get\textunderscore university & Fact extraction \\ \hline
        \end{tabular}
        \caption{Exported functions of the wikiproc package}
        \label{table:intent_table}
    \end{table}

    \subsection{Interworking of R and Rasa}

    \includegraphics[width=\textwidth]{Wiki_Chatbot_Architecture}


\section{Results}
    \subsection{Precision}
    \subsection{Recall}
    \subsection{Conclusion}

\section{Restraints}
    \subsection{Thoughts on Rasa}
    \subsection{Thoughts on Knowledge Extraction}

\section{Outlook}
    \subsection{Learning to Ask Approach}


\end{document}