Commit d36dfca9 authored by Jonas Richter's avatar Jonas Richter
Browse files

Merge remote-tracking branch 'origin/main'

parents 798828cc c45eab59
......@@ -50,3 +50,10 @@ for relation in relations:
The final dataset is located at final_dataset in [JSONL](https://git.informatik.uni-leipzig.de/nw20hewo/relation-extraction/-/blob/main/final_dataset/resistance.jsonl) and [ConLL](https://git.informatik.uni-leipzig.de/nw20hewo/relation-extraction/-/blob/main/final_dataset/resistance.conll) format.
It also contains the data [split](https://git.informatik.uni-leipzig.de/nw20hewo/relation-extraction/-/blob/main/final_dataset/split) we used to train the model.
## Project submissions
- The final paper for this project can be found [here](https://git.informatik.uni-leipzig.de/nw20hewo/relation-extraction/-/blob/main/paper/REsistance.pdf)
- The datasheet can be found [here](https://git.informatik.uni-leipzig.de/nw20hewo/relation-extraction/-/blob/main/Datasheet_for_dataset_template.pdf)
- The model card can be found [here](https://git.informatik.uni-leipzig.de/nw20hewo/relation-extraction/-/blob/main/Model_card_template.pdf)
\section{Conclusion}
\label{sec:conclusion}
We presented a German dataset for both \ac{NER} and \ac{RE} in the historical domain called \textit{REsistance}. It consists of 2980 sentences with 4312 annotated relations of 46 different relation classes. Especially for the task of \ac{RE}, this dataset was validated in terms of its usefulness with a model we built. To fully address the research question which were posed at the beginning, this model was then evaluated in terms of its ability to detect the presence of relations and in terms of its generalization. The model achieves an F1 score of 0.466 on our dataset and 0.283 on a comparison generalizability dataset based on random Wikipedia biographies.
We presented a German dataset for both \ac{NER} and \ac{RE} in the historical domain called \textit{REsistance}. It consists of 2980 sentences with 4312 annotated relations of 46 different relation classes. Especially for the task of \ac{RE}, this dataset was validated in terms of its usefulness with a model we built. To fully address the research question which were posed at the beginning of this work, this model was eventually evaluated in terms of its ability to detect the presence of relations and in terms of its generalization. The model achieves an F1 score of 0.466 on our dataset and 0.283 on a comparison generalizability dataset based on random Wikipedia biographies.
Generally, the results of our experiments indicate it takes less work and time to train a model by an Active Learning approach even if no annotated data is available at the beginning, than taking heuristics into account for the \ac{RE} task. The findings show that it is able to not only overcome those challenges encountered when manually annotating a dataset for \ac{RE}, but also to outperform the baseline. Furthermore, our scores indicate that the presented workflow is able to generalize and can be used on any domain using data in a suitable format.
......@@ -10,20 +10,20 @@ Detecting relations raises challenges we had to face as well. Here we identified
Additionally, due to a lack of resources, it was decided not to have a second person check every single sentence, but only those where doubts about the annotation arose.
Furthermore, in the evaluation a comparability of our model with the heuristic baseline is not completely ensured due to their differences in output.
\subsection{Outlook}
From the limitations, we can conclude that investigations further examinations can be carried out.
From the limitations, we can conclude that additional investigations can be carried out.
Future work will focus on steps to further improve the success rate of our model.
A first step towards this would be to increase the size of the annotated dataset through more iterations of the described Active Learning process above.
A first step in this direction would be to increase the size of the annotated dataset through more iterations of the described Active Learning process above.
Another way could be to merge some relation classes into coarser classes. This is because one of our model's errors is the confusion of similar relation classes. As seen in Table \ref{tab:conf_matrix_relation_classes} the relation classes \textit{Aufenthalt\_in}, \textit{wohnt\_in}, and \textit{arbeitet\_in} could be examples of potential fusion candidates and therefore provide a set of relation classes that also roughly match similar datasets in terms of number of classes \cite{lai-etal-2021-event,miwa-sasaki-2014-modeling,plum2022biographical}.
Another option could be to merge some relation classes into coarser classes. This is because one of our model's errors is the confusion of similar relation classes. As seen in Table \ref{tab:conf_matrix_relation_classes} the relation classes \textit{Aufenthalt\_in}, \textit{wohnt\_in}, and \textit{arbeitet\_in} could be examples of potential fusion candidates and therefore provide a set of relation classes that also roughly match similar datasets in terms of number of classes \cite{lai-etal-2021-event,miwa-sasaki-2014-modeling,plum2022biographical}.
In a realistic application environment, cross-sentence relations could be included in future annotation processes as well, as this is closer to the nature of texts. % Hereby not only the \ac{NER} but also an Entity Linking process during the whole \ac{RE} system might be a useful addition.
Also, the domain of the dataset should be extended by including more sources and suitable relation classes or even open-ended ones for the related domain to be more useful for generalizing \ac{RE} approaches. However, before this is done, the introduced relation classes and the dataset should be discussed and reviewed by historians to ensure that they are appropriate for this domain and to clarify the information need around specific domain research questions.
Also, the domain of the dataset should be extended by including more sources and suitable relation classes or even open-ended ones for the relevant domain to be more useful for generalizing \ac{RE} approaches. However, before this is done, the introduced relation classes and the dataset should be discussed and reviewed by historians to ensure that they are appropriate for this domain and to clarify the information need around specific domain research questions.
Since coreferences were also annotated during the work on the dataset and a coreference detection model coreference was trained too, another future step could be to experiment with this dataset for \textit{Coreference Resolution}, the task of returning various expressions and pronouns to the original reference object, as well.
Additionally, the dataset is also suited for the event extraction task because many dates and other time-related terms are included in the biographies.
Additionally, the dataset is also suitable for the \textit{Event Extraction} task since many dates and other time-related terms are included in the biographies.
%Another idea for the future could be the design of an interface between an annotation tool like Doccano and an \ac{NLP} framework like Flair to make the described workflow more accessible.
\section{Experiments}
\label{sec:experiments}
In the following section, we describe the experimental conditions of our method. The initial experiments and trials were conducted in the Google Colab \footnote{\url{https://colab.research.google.com/}} environment and then continued on the Webis Group cluster provided by the Digital Bauhaus Lab\footnote{\url{https://webis.de/facilities.html}} due to Google Colab's non-permanent availability of GPUs. Because the GPU was accessible constantly, the final available models were then trained on the cluster.
In the following section, we describe the experimental conditions of our method. The initial experiments and trials were conducted in the Google Colab \footnote{\url{https://colab.research.google.com/}} environment and then continued on the Webis Group cluster provided by the Digital Bauhaus Lab\footnote{\url{https://webis.de/facilities.html}} due to Google Colab's non-permanent availability of GPUs.
% Because the GPU was accessible constantly, the final available models were then trained on the cluster.
The training was performed using the data described in the previous section.
We split the data into three subsets: 80\% for training, 10\% for validation, and 10\% for testing. Training is performed on the training split. In the training process, the model parameters were adjusted using the validation set, and finally we used the test set to obtain an unbiased evaluation result.
The experiments were performed separately for the \ac{NER} and \ac{RE} tasks, each of which also considered coreferences and the absence of them.
By training our own model on the \ac{NER} task, we expected to simplify the subsequent \ac{RE} task, and thus to recognize more entities in our historical domain. Combining both models, our approach provides a holistic solution for recognizing named entities and relationships in individual sentences and texts.
By training our own model on the \ac{NER} task, we expected to simplify the subsequent \ac{RE} task, and thus to recognize more entities in our historical domain. Combining both models, our approach provides a holistic solution for recognizing Named Entities and relationships in individual sentences and texts.
......@@ -63,7 +64,7 @@ To compare the results of the trained \ac{RE} model with this baseline, the test
It has to be noted, however, that not only the baseline approaches of \ac{NER} and \ac{RE} differ significantly from the approach described here.
But also, the resulting output format of the base model varies widely from the format of the annotated dataset.
Because of these differences, we decided that it was sufficient to evaluate the baseline method's ability to detect the presence of relations in a given text.
We considered its best-case scenario and assumed that each found relation is correct regarding involved entities and relation classes. The output relations classified by our baseline were then compared to the actual annotations of the test set in terms of presence of a relation, which showed that this method had a recall of 17.4 \% and a precision of 93.5 \% with a F1 score of 0.294.
We considered its best-case scenario and assumed that each found relation is correct regarding involved entities and relation classes. The output relations classified by our baseline were then compared to the actual annotations of the test set in terms of presence of a relation, which showed that this method had a Recall of 17.4 \% and a Precision of 93.5 \% with a F1 score of 0.294.
\begin{table}[htbp]
\footnotesize
\centering
......@@ -73,10 +74,10 @@ We considered its best-case scenario and assumed that each found relation is cor
\makecell{Positive} & \makecell{72} & \makecell{341} \\
\makecell{Negative} & \makecell{5} & \makecell{67} \\
\end{tabularx}
\caption{Confusion Matrix of baseline method}
\caption{Confusion Matrix of baseline method.}
\label{table:confusion_matrix}
\end{table}
\subsection{But does it generalize?}
To prove that the introduced \ac{RE} model is indeed able to generalize knowledge from this very specific domain, it was run over Wikipedia entries of a random subset of a list of biographies in German language\footnote{\url{https://de.wikipedia.org/wiki/Liste_der_Biografien}}.
To prove that the introduced \ac{RE} model is indeed able to generalize knowledge from this very specific domain, it was run over Wikipedia entries of a random subset of a list of biographies in German language\footnote{\url{https://de.wikipedia.org/wiki/Liste_der_Biografien}}(WikiBioData).
......@@ -5,7 +5,7 @@ In the following section, we present our process for creating the dataset. Our w
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{Images/Workflow.png}
\caption{The conceptualization of chronological workflow to first create a dataset and second build a model to apply to that dataset}
\caption{The conceptualization of chronological workflow to first create a dataset and second build a model to apply to that dataset.}
\label{fig:workflow}
\end{figure}
\subsection{Data Scraping and Preprocessing}
......@@ -49,7 +49,7 @@ For an ergonomic annotation process, the Doccano software \cite{doccano} was cho
\paragraph{Named Entity and Coreference Annotation}\mbox{}\\
%Before annotating the relations, first it is necessary to ensure that all the entities in the examined sentences are marked as such, since relations can only be found between entity pairs. For this process, existing frameworks can be used to facilitate the manual annotation.
In \ac{IE}, the research project FashionBrain has been able to deliver outstanding result. In this context the team members have focused, among other topics, on \ac{NER} and on the main task of our contribution, the \ac{RE} \cite{checco2017fashionbrain}, whereby the framework \textit{Flair} was developed for both purposes. Thus, Flair offers a state-of-the-art model for \ac{NER} \cite{akbik-etal-2018-contextual} that is able to automatically recognize entities within a sentence and assign them to a type. When applied to our sentences, while mainly persons and locations were classified quite accurately, the difficulty was that some organizations (e.g., 'Résistance', 'Volksgerichtshof') or other entities with a specific historical reference were not recognized. In order to obtain the relations that are associated to the unrecognized entities, we first had to annotate them on ourselves.
Additionally, inspired by other promising approaches in \ac{RE} that involve the development of a dataset and also a training process of a model, where the authors indicate that they achieve higher recall and F1 scores when including coreferences, we extended our annotation process by covering them as well \cite{chan-roth-2010-exploiting,gabbard-etal-2011-coreference,luan-etal-2018-multi}. A coreference can be declared as the property of different nominal expressions or pronouns to refer to the same reference object or reference identity \cite{crystal2011dictionary}. For this purpose, we introduced additional tags that contain the information that the annotated word is not a single entity but in fact a coreference and also the type of the entity to which these coreferences refer (e.g, Coref\_LOC). This represented another manual effort for us, since persons represented by personal pronouns cannot be recognized directly and require an accurate procedure.
Additionally, inspired by other promising approaches in \ac{RE} that involve the development of a dataset and also a training process of a model, where the authors indicate that they achieve higher Recall and F1 scores when including coreferences, we extended our annotation process by covering them as well \cite{chan-roth-2010-exploiting,gabbard-etal-2011-coreference,luan-etal-2018-multi}. A coreference can be declared as the property of different nominal expressions or pronouns to refer to the same reference object or reference identity \cite{crystal2011dictionary}. For this purpose, we introduced additional tags that contain the information that the annotated word is not a single entity but in fact a coreference and also the type of the entity to which these coreferences refer (e.g, Coref\_LOC). This represented another manual effort for us, since persons represented by personal pronouns cannot be recognized directly and require an accurate procedure.
\paragraph{Relation Annotation}\mbox{}\\
Simultaneously with the annotation of the entities came the annotation process of the relations. In the initial phase of determining the number of different annotation types took place, we were faced the trade-off between information preservation and generalization. Hereby we tried to find a balance and consequently settled for 46 different relation types, especially between the entity pairs \textit{PER-PER, PER-ORG, PER-LOC}. Looking at the semantic structure of the chosen classes, we can roughly identify a hierarchy in terms of specificity. Within this order we tried to assign the most fine-grained one to a given relation and at while using a more coarse class if none of the specific classes could be considered suitable. Initially, we only annotated a small subset of the entire data from those 945 Wikipedia entries before performing Active Learning.
......
......@@ -33,7 +33,7 @@ MICRO F1 & \textbf{0.852} & 0.803 \\%& 0.798 & \textb
\caption{Micro F1 scores of the \ac{NER} model compared to the Flair's \ac{NER} model on our dataset listed for each tag. The higher value for an entity type is printed in bold.}
\label{tab:f1_ner}
\end{table}
The results show that our model and the Flair \ac{NER} base model perform very similarly on most named entity types on the domain-specific test dataset, so that the results per type differ only marginally. The exception are coreference entity types we introduced, which the base model can not recognize, but which are crucial for our use case, hence we also consider them in the overall evaluation, resulting in our new model having a higher F1 score across all entity types.
The results show that our model and the Flair \ac{NER} base model perform very similarly on most Named Entity types on the domain-specific test dataset, so that the results per type differ only marginally. The exception are coreference entity types we introduced, which the base model can not recognize, but which are crucial for our use case, hence we also consider them in the overall evaluation, resulting in our new model having a higher F1 score across all entity types.
%\begin{table}
......@@ -69,14 +69,14 @@ The results show that our model and the Flair \ac{NER} base model perform very s
\begin{tabularx}{\linewidth}{X|XX}
& \makecell{Model with Coref} & \makecell{Model no Coref} \\
\hline
REsistance dataset & \makecell{0.466} & \makecell{0.462}\\
Biographical Domain & \makecell{0.283} & \makecell{0.336}\\
REsistance dataset & \makecell{0.466} & \makecell{0.333}\\
WikiBioData & \makecell{0.283} & \makecell{0.336}\\
\end{tabularx}
\caption{Micro F1 scores of the \ac{RE} model on our dataset and on the comparison dataset considering the presence and absence of coreferences.}
\label{tab:f1_relation_extraction}
\end{table}
Table \ref{tab:f1_relation_extraction} shows the results of the \ac{RE} model on the previously unseen test split of \textit{REsistance} and the generalizability comparison dataset. We trained the RE model once with the \textit{REsistance} dataset containing coreferences and once without to verify if this had an impact on the performance of the RE model. This seems to be the case to a limited extent, as the model performs slightly better on the generalizability dataset without coreferences. However, on our data set it makes no significant difference.
Table \ref{tab:f1_relation_extraction} shows the results of the \ac{RE} model on the previously unseen test split of \textit{REsistance} and the generalizability comparison dataset. We trained the RE model once with the \textit{REsistance} dataset containing coreferences and once without to verify if this had an impact on the performance of the RE model. This seems to be the case to a limited extent, as the model performs slightly better on the generalizability dataset without coreferences.
To understand this F1 score in terms of the domain-specific \ac{RE} task, we compare our model with other domain-specific \ac{RE} models. The \ac{RE} approach of \citet{DBLP:journals/corr/abs-2004-03283} was applied to a German corpus of traffic and industrial events and achieve a maximum F1 score of 0.28, while the domain-specific approaches on English corpora reach maximum F1 scores of 0.474 and 0.6 \cite{10.1145/3038912.3052708, bioMedReEx}.
Indicating that our model is in the upper range of models for relation extraction in domain-specific applications, more specific in the German language.
......@@ -84,7 +84,7 @@ Indicating that our model is in the upper range of models for relation extractio
As presented in subsection \ref{subsec:baseline} the F1 score of the baseline method is 0.294 which our trained model is able to exceed with a F1 score of 0.466.
Interpreting the results of the heuristic method as optimistically as possible and our \ac{RE} model as pessimistically as possible with the assumption of the worst possible outcome, we can still assume that our model is better able to recognize existing relations in sentences.
Table \ref{tab:overview_re} in the Appendix \ref{sec:appendix} shows the distribution of the individual relations in the dataset and the respective F1 scores evaluated by our final \ac{RE} model on our dataset and the comparison dataset. On the one hand, this result suggests that relations that occur more frequently in the dataset are also recognized correctly more frequently; on the other hand, this is not true for all relations. This rather leads to the assumption that certain relations are more difficult to detect than others. Indicating that increasing the number of data points for specific poor performing relation types in the dataset would not lead to much improvement. For instance, there are relations, such as \textit{Ausbildung\_an}, that score an F1 score of 0.8 even with a mediocre occurrence of 45 samples in the data set, while \textit{Gruppenmitglied}, \textit{Bekanntschaft}, and \textit{Führungsposition} score much worse with significantly more frequent occurrences. This observation is in line with the results of the confusion matrix in Figure \ref{fig:cm}, which already suggests that certain relations cannot be distinguished from each other with sufficient precision. Furthermore, the evaluation on the generalization dataset shows that particular relations are generalizable, while others are either not represented in the new dataset or perform relatively poorly. During the annotation process of the generalization dataset, we also noticed that specific relations such as \textit{kollaboriert\_mit} have different semantic meanings depending on the context, which can also complicate the generalization of relations.
Table \ref{tab:overview_re} in the Appendix \ref{sec:appendix} shows the distribution of the individual relations in the dataset and the respective F1 scores evaluated by our final \ac{RE} model on our dataset and the comparison dataset. On the one hand, this result suggests that relations that occur more frequently in the dataset are also recognized correctly more frequently; on the other hand, this is not true for all relations. This rather leads to the assumption that certain relations are more difficult to detect than others. Indicating that increasing the number of data points for specific poor performing relation types in the dataset would not provide to much improvement. For instance, there are relations, such as \textit{Ausbildung\_an}, that achieve an F1 score of 0.8 even with a mediocre occurrence of 46 samples in the data set, while \textit{Gruppenmitglied}, \textit{Bekanntschaft}, and \textit{Führungsposition} score much worse with significantly more frequent occurrences. This observation is in line with the results of the confusion matrix in Figure \ref{fig:cm}, which already suggests that certain relations cannot be distinguished from each other with sufficient Precision. Furthermore, the evaluation on the generalization dataset shows that particular relations are generalizable, while others are either not represented in the new dataset or perform relatively poorly. During the annotation process of the generalization dataset, we also noticed that specific relations such as \textit{kollaboriert\_mit} have different semantic meanings depending on the context, which can also complicate the generalization of relations.
......
......@@ -5,7 +5,7 @@
\documentclass[11pt]{article}
% Remove the "review" option to generate the final version.
\usepackage[review]{acl}
\usepackage[]{acl}
% Standard package includes
\usepackage{times}
......@@ -115,8 +115,7 @@ Detecting and classifying semantic relationships into categories between previou
\section{Appendices}
\label{sec:appendix}
Icons of figure \ref{fig:workflow} made by Becris, Darius Dan, Freepik, kmg design, Parzival' 1999 and Pixel perfect from flaticon.com
Our results are publicly accessible on \href{https://git.informatik.uni-leipzig.de/nw20hewo/relation-extraction}{https://git.informatik.uni-leipzig.de/nw20hewo/relation-extraction}
%Appendices are material that can be read, and include lemmas, formulas, proofs, and tables that are not critical to the reading and understanding of the paper.
%Appendices should be \textbf{uploaded as supplementary material} when submitting the paper for review.
%Upon acceptance, the appendices come after the references, as shown here.
......@@ -142,7 +141,7 @@ Icons of figure \ref{fig:workflow} made by Becris, Darius Dan, Freepik, kmg desi
\centering
\scriptsize
\begin{tabular}{l|ccc}
Relation & Samples & REsistance & WikiBio \\
Relation & Samples & REsistance & WikiBioData \\
\hline
Mitglied\_bei & 392 & 0.597 & 0.000 \\
Aufenthalt\_in & 322 & 0.535 & 0.500 \\
......@@ -216,7 +215,7 @@ Coref\_ORG & 35
\newpage
\begin{lstlisting}[label={listing:conll}, breaklines]
\begin{lstlisting}[basicstyle=\small,label={listing:conll}, breaklines,caption={Example sentence from the \textit{REsistence} dataset in the ConLL format.},captionpos=b]
# text = Ab 1909 organisierte er sich im Deutschen Metallarbeiterverband ( DMV ) , wurde von 1910 bis 1917 Mitglied der SPD , anschließend trat er der USPD bei und später der KPD .
# relations = 7;8;10;10;Kurzform|4;4;7;8;Mitglied_bei|4;4;20;20;Mitglied_bei|4;4;26;26;Mitglied_bei|4;4;31;31;Mitglied_bei|4;4;20;20;ausgetreten_bei
1 Ab O
......@@ -255,14 +254,16 @@ Coref\_ORG & 35
\begin{figure*}[htbp]
\begin{figure*}[htbp]
\centering
\includegraphics[width=\linewidth]{Images/confusion_matrix_final.png}
\caption{Confusion matrix of the predictions of the Relation Extraction model and the corresponding gold labels. Horizontally shown are the gold labels, vertically the predictions.}
\label{fig:cm}
\end{figure*}
\newpage
Icons of Figure \ref{fig:workflow} made by Becris, Darius Dan, Freepik, kmg design, Parzival' 1999 and Pixel perfect from flaticon.com
\end{document}
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment