Tinsaye Abye requested to merge 3-pre-and-post-processing into develop May 29, 2020

EDIT: due to a miscalculation of idf some of the concerns below are no more accurate. See !5 (merged)

Runs a full deduplication on the amazonGoogle dataset as a poc.

Blocking Uses StandardBlocking with PrefixLength 1 for key generator
Builds documentsfrequency as described in #2 (closed)
Clustering is done by CLIP since it has a significantly better (precision) performance than ConnectedComponents
Perfomance measurement ist done by ClusteringQualityMeasures

First results for threshold 0.70

Recall = 0.024596464258262875
Precision = 0.16326530612244897
FMeasure = 0.04275217100868404
AllPositives = 196
MaxClusterSize = 2
AverageClusterSize = 1.0446164352378784
TruePositives = 32
PerfectCompleteClusterNo = 196

First observation

Recall

Half of the expected perfect matches fall out while blocking since the titles do not start with the same letter -> adding "manufatures" as second blocking key could lead to a small improvement
A portion of the titles are empty -> leading to a simValue of 0

Precision

Some non perfect matches are still similar ( example adobe products)
A quick analysis of the perfect matches by directly measuring there similarity with tfidf show

               sim
count  1300.000000
mean      0.339630
std       0.264138
min       0.000000
25%       0.083027
50%       0.340256
75%       0.544598
max       1.000000

that the averge similariy digree is at 0.33 (0.4 after removing empty strings). Decreasing the threashold by that far would also lead to many false positives.

In comparison MongeElkanJaroWinkler which also is a sentence based sim measurement at first glance seems to have a better recall but lower precision.

Recall = 0.05073020753266718
Precision = 0.07399103139013453
FMeasure = 0.06019151846785226
AllPositives = 892
MaxClusterSize = 2
AverageClusterSize = 1.2412767108466325
TruePositives = 66
PerfectCompleteClusterNo = 892
GtRecordNo = 1301

Edited Jun 01, 2020 by Tinsaye Abye

[feature] adds a simple full deduplication of the amazonGoogle dataset

First observation

Recall

Precision

Merge request reports