Resolve "analyse augmented tfidf"
Closes #9 (closed)
The issue with augmented TF-IDF was apparently caused by the fact that never return 0 for an a>0. This leads to a similarity value above 0 even if the attributes are completely distinct token sets In fact for example for a=0.4 the minimum sim value is at about 0.7. This can be verified by a frequency analysis of returned similarity values on a given sample dataset.
For a=0 this is not an issue.
As a solution, using Cohen et al. definition of TF-IDF, one can overcome the issue of atf(s,t) not being equal to zero for distinct s and t. Their definition uses the intersection of the token sets which leads to atf(s,t)=0 for two distinct sets.
Edited by Tinsaye Abye