Skip to content

Resolve "analyse augmented tfidf"

Tinsaye Abye requested to merge 9-fix-augmented-tfidf into develop

Closes #9 (closed)

The issue with augmented TF-IDF was apparently caused by the fact that image never return 0 for an a>0. This leads to a similarity value above 0 even if the attributes are completely distinct token sets In fact for example for a=0.4 the minimum sim value is at about 0.7. This can be verified by a frequency analysis of returned similarity values on a given sample dataset.

image

For a=0 this is not an issue.

As a solution, using Cohen et al. definition of TF-IDF, one can overcome the issue of atf(s,t) not being equal to zero for distinct s and t. Their definition uses the intersection of the token sets which leads to atf(s,t)=0 for two distinct sets. image

Edited by Tinsaye Abye

Merge request reports