[feature] adds a simple full deduplication of the amazonGoogle dataset
EDIT: due to a miscalculation of idf some of the concerns below are no more accurate. See !5 (merged)
Closes #3 (closed)
Runs a full deduplication on the amazonGoogle dataset as a poc.
-
Blocking Uses
StandardBlocking
withPrefixLength
1 for key generator - Builds documentsfrequency as described in #2 (closed)
-
Clustering is done by CLIP since it has a significantly better (precision) performance than
ConnectedComponents
-
Perfomance measurement ist done by
ClusteringQualityMeasures
First results for threshold 0.70
Recall = 0.024596464258262875
Precision = 0.16326530612244897
FMeasure = 0.04275217100868404
AllPositives = 196
MaxClusterSize = 2
AverageClusterSize = 1.0446164352378784
TruePositives = 32
PerfectCompleteClusterNo = 196
First observation
Recall
- Half of the expected perfect matches fall out while blocking since the titles do not start with the same letter -> adding "manufatures" as second blocking key could lead to a small improvement
- A portion of the titles are empty -> leading to a simValue of 0
Precision
- Some non perfect matches are still similar ( example adobe products)
- A quick analysis of the perfect matches by directly measuring there similarity with tfidf show
sim
count 1300.000000
mean 0.339630
std 0.264138
min 0.000000
25% 0.083027
50% 0.340256
75% 0.544598
max 1.000000
that the averge similariy digree is at 0.33 (0.4 after removing empty strings). Decreasing the threashold by that far would also lead to many false positives.
In comparison MongeElkanJaroWinkler which also is a sentence based sim measurement at first glance seems to have a better recall but lower precision.
Recall = 0.05073020753266718
Precision = 0.07399103139013453
FMeasure = 0.06019151846785226
AllPositives = 892
MaxClusterSize = 2
AverageClusterSize = 1.2412767108466325
TruePositives = 66
PerfectCompleteClusterNo = 892
GtRecordNo = 1301
Edited by Tinsaye Abye