[fix] idf calculation by using pre-calculated documents count
Closes #7 (closed)
Until now there was a big mistake in the idf calculation where N=2 from the exposé draft was hard coded as the total number of documents in
log(N / df)
Now we count all documents by
final DataSet<Long> documentsCount = Count.count(logicalGraph.getVertices());
and broadcast this value which is then used in TFIDF.java
This effect the previous first observations made in !3 (merged) and also resulting in a significantly better performance.
- Previous benchmark without
documentsCount
- Benchmark with
documentsCount
At least this weird spike from Img 1 is gone now and overall we see a higher trend in Precision, Recall and F-Measure.
But it still remains slightly ambiguous where the best threshold is but this is under further investigation.
Edited by Tinsaye Abye