[feature] implements soft-tfidf (!7) · Merge requests · Tinsaye Abye / TFIDF-FAMER

Tinsaye Abye requested to merge 11-soft-tfidf into develop Jun 23, 2020

Implements Soft-Tfidf as proposed by Cohen et al.

As pointed out by Moreau et al. there is a small mistake in the definition so that in case of similar words sim(w,s) > threshold we would calculate tfidf with w even thought this word not necessarily exist (but a similar and not equal word) in the second document.

Cohen et al. also suggested a high similarity threshold for the tokens. 0.9 seems to be a good starting point for JaroWinkler but more evaluations should follow up.

The Performance of stfidf is very similar compared to only tfidf on the AmazonGoogle dataset using the description or titles.

Edited Jun 23, 2020 by Tinsaye Abye

[feature] implements soft-tfidf

Merge request reports