Skip to content

[feature] adds a simple full deduplication of the amazonGoogle dataset

Tinsaye Abye requested to merge 3-pre-and-post-processing into develop

EDIT: due to a miscalculation of idf some of the concerns below are no more accurate. See !5 (merged)

Closes #3 (closed)

Runs a full deduplication on the amazonGoogle dataset as a poc.

  • Blocking Uses StandardBlocking with PrefixLength 1 for key generator
  • Builds documentsfrequency as described in #2 (closed)
  • Clustering is done by CLIP since it has a significantly better (precision) performance than ConnectedComponents
  • Perfomance measurement ist done by ClusteringQualityMeasures

First results for threshold 0.70

Recall = 0.024596464258262875
Precision = 0.16326530612244897
FMeasure = 0.04275217100868404
AllPositives = 196
MaxClusterSize = 2
AverageClusterSize = 1.0446164352378784
TruePositives = 32
PerfectCompleteClusterNo = 196

First observation

Recall

  • Half of the expected perfect matches fall out while blocking since the titles do not start with the same letter -> adding "manufatures" as second blocking key could lead to a small improvement
  • A portion of the titles are empty -> leading to a simValue of 0

Precision

  • Some non perfect matches are still similar ( example adobe products)
  • A quick analysis of the perfect matches by directly measuring there similarity with tfidf show
               sim
count  1300.000000
mean      0.339630
std       0.264138
min       0.000000
25%       0.083027
50%       0.340256
75%       0.544598
max       1.000000

that the averge similariy digree is at 0.33 (0.4 after removing empty strings). Decreasing the threashold by that far would also lead to many false positives.

perfect_tfidf

In comparison MongeElkanJaroWinkler which also is a sentence based sim measurement at first glance seems to have a better recall but lower precision.

Recall = 0.05073020753266718
Precision = 0.07399103139013453
FMeasure = 0.06019151846785226
AllPositives = 892
MaxClusterSize = 2
AverageClusterSize = 1.2412767108466325
TruePositives = 66
PerfectCompleteClusterNo = 892
GtRecordNo = 1301
Edited by Tinsaye Abye

Merge request reports

Loading