[fix] idf calculation by using pre-calculated documents count (!5) · Merge requests · Tinsaye Abye / TFIDF-FAMER · GitLab

Snippets Groups Projects

Merged Tinsaye Abye requested to merge 7-fix-idf-calculation-with-documentscount into develop 4 years ago

Closes #7 (closed)

Until now there was a big mistake in the idf calculation where N=2 from the exposé draft was hard coded as the total number of documents in

log(N / df)

Now we count all documents by

final DataSet<Long> documentsCount = Count.count(logicalGraph.getVertices());

and broadcast this value which is then used in TFIDF.java

This effect the previous first observations made in !3 (merged) and also resulting in a significantly better performance.

Previous benchmark without documentsCount

Benchmark with documentsCount

At least this weird spike from Img 1 is gone now and overall we see a higher trend in Precision, Recall and F-Measure.

But it still remains slightly ambiguous where the best threshold is but this is under further investigation.

Edited 4 years ago by Tinsaye Abye

Activity

Tinsaye Abye added bug label 4 years ago

added bug label
Tinsaye Abye mentioned in merge request !3 (merged) 4 years ago

mentioned in merge request !3 (merged)
Tinsaye Abye mentioned in merge request !4 (merged) 4 years ago

mentioned in merge request !4 (merged)
Tinsaye Abye changed the description 4 years ago

changed the description
Tinsaye Abye mentioned in commit 4df0c075 4 years ago

mentioned in commit 4df0c075
Tinsaye Abye merged 4 years ago

merged

Please register or sign in to reply