Skip to content
Snippets Groups Projects
Commit 425a2656 authored by Lucas Schons's avatar Lucas Schons
Browse files

Fix some more spelling mistakes

parent df83ccf1
No related branches found
No related tags found
1 merge request!71Documentation: Final Report
......@@ -57,20 +57,20 @@
\subsection{Rasa Setup and Intents}
\subsection{Scrapping of Source Texts}
Wikipedia was choosen as resource for texts as it provides texts of relatively long length in a somewhat uniform manner.
Wikipedia was chosen as resource for texts as it provides texts of relatively long length in a somewhat uniform manner.
While Wikipedia does have a \textit{Physicists} category\footnote{\url{https://en.wikipedia.org/wiki/Category:Physicists}},
it is fragmented into somewhat arbitrary subcategories and thus not optimal to use as a collection.
However Wikipedia also has a \textit{List of physicists} which contains 981 physicists and was used to build the collection used. \\
Data scraping was done using the R Package \textit{WikipediR}, a wrapper around the Wikipedia API.
Articles were downloaded as HTML\footnote{HTML was choosen over wikitext to ease text cleaning} and afterwards strapped of all HTML Tags and Quotation marks.
Articles were downloaded as HTML\footnote{HTML was chosen over wikitext to ease text cleaning} and afterwards strapped of all HTML Tags and Quotation marks.
\subsection{Fact Extraction Approaches}
Fact extraction greatly varies depending on the nature of the fact to extract.
As all approaches leverage on some form of NER or POS tagging, annotations were created for all text.
This was done using the R Package \textit{cleanNLP} with an spaCy backend to create NER and POS tags, as well as lemmatization. \\
Fact extraction for physicists spouses was done using pre-defined patterns on word lemmata.\footnote{Functionality to use patterns on POS Tags is also available but did not yield a better outcome.}
A pattern is consists of word lemmata to be matched (including wildcards) as well as defined places to look for the name of the phisicit as well as his/her spouse.
When a matching phrase is found the results are verified by checking that the corresponding physicist is mentioned as well as the potential spouse beeing detected as a Person by the NER tagger.
A pattern is consists of word lemmata to be matched (including wildcards) as well as defined places to look for the name of the physicist as well as his/her spouse.
When a matching phrase is found the results are verified by checking that the corresponding physicist is mentioned as well as the potential spouse being detected as a Person by the NER tagger.
\section{Software Architecture}
......@@ -128,7 +128,7 @@
\subsection{R Package 'wikiproc'}
All functionality to extract facts, download data from wikipedia as well as some utility functions
is encapsulated inside the \textit{wikiproc} R Package.
This allows for a better management of dependencys as well as inclusion of unit tests for fact extraction methods.
This allows for a better management of dependencies as well as inclusion of unit tests for fact extraction methods.
\begin{table}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment