[feature] implements soft-tfidf
Closes #11 (closed)
Implements Soft-Tfidf as proposed by Cohen et al.
As pointed out by Moreau et al. there is a small mistake in the definition so that in case of similar words sim(w,s) > threshold
we would calculate tfidf with w
even thought this word not necessarily exist (but a similar and not equal word) in the second document.
Cohen et al. also suggested a high similarity threshold for the tokens. 0.9 seems to be a good starting point for JaroWinkler but more evaluations should follow up.
The Performance of stfidf is very similar compared to only tfidf on the AmazonGoogle dataset using the description or titles.
Edited by Tinsaye Abye