Skip to content

[fix] idf calculation by using pre-calculated documents count

Tinsaye Abye requested to merge 7-fix-idf-calculation-with-documentscount into develop

Closes #7 (closed)

Until now there was a big mistake in the idf calculation where N=2 from the exposé draft was hard coded as the total number of documents in

log(N / df)

Now we count all documents by

final DataSet<Long> documentsCount = Count.count(logicalGraph.getVertices());

and broadcast this value which is then used in TFIDF.java


This effect the previous first observations made in !3 (merged) and also resulting in a significantly better performance.

image

  1. Previous benchmark without documentsCount

image

  1. Benchmark with documentsCount

At least this weird spike from Img 1 is gone now and overall we see a higher trend in Precision, Recall and F-Measure.

But it still remains slightly ambiguous where the best threshold is but this is under further investigation.

Edited by Tinsaye Abye

Merge request reports