Skip to content

[feature] implements soft-tfidf

Tinsaye Abye requested to merge 11-soft-tfidf into develop

Closes #11 (closed)

Implements Soft-Tfidf as proposed by Cohen et al.

As pointed out by Moreau et al. there is a small mistake in the definition so that in case of similar words sim(w,s) > threshold we would calculate tfidf with w even thought this word not necessarily exist (but a similar and not equal word) in the second document.

image

Cohen et al. also suggested a high similarity threshold for the tokens. 0.9 seems to be a good starting point for JaroWinkler but more evaluations should follow up.

image

The Performance of stfidf is very similar compared to only tfidf on the AmazonGoogle dataset using the description or titles.

image image

Edited by Tinsaye Abye

Merge request reports