[fix] idf calculation by using pre-calculated documents count
Closes #7 (closed)
Until now there was a big mistake in the idf calculation where N=2 from the exposé draft was hard coded as the total number of documents in
log(N / df)
Now we count all documents by
final DataSet<Long> documentsCount = Count.count(logicalGraph.getVertices());
and broadcast this value which is then used in TFIDF.java
This effect the previous first observations made in !3 (merged) and also resulting in a significantly better performance.
- Previous benchmark without
documentsCount
- Benchmark with
documentsCount
At least this weird spike from Img 1 is gone now and overall we see a higher trend in Precision, Recall and F-Measure.
But it still remains slightly ambiguous where the best threshold is but this is under further investigation.
Edited by Tinsaye Abye
Merge request reports
Activity
added bug label
mentioned in merge request !3 (merged)
mentioned in merge request !4 (merged)
mentioned in commit 4df0c075
Please register or sign in to reply