[feature] document frequency count functions and broadcasting (!2) · Merge requests · Tinsaye Abye / TFIDF-FAMER

Merged Tinsaye Abye requested to merge 2-document-frequency-pre-calculation into develop 4 years ago

Implements to functions to calculate the document frequency

ExtractWords is a FlatMap to map sentences to words per vertex
this can then be grouped by words
CountDocumentFrequencyis a groupReduce function to count the amount of documents a word occurs.

SimilarityMeasurerWithDocumentFrequencyBroadcast gets the broadcasted variable and sets it to TFIDFSimilarityComponent.

The Linker should calculate the DocumentFrequency and broadcast it to the DOCUMENT_FREQUENCY_BROADCAST variable.

    Map<String, Integer> wordsInDoc =
      benchmarkDataCollection.getVertices().flatMap(new WordExtractor(sourceAttribute, tokenizer))
        .groupBy(1)
        .reduceGroup(new CountDocumentFrequency())
        .collect().stream().collect(Collectors.toMap(t1 -> t1.f0, t2 -> t2.f1)); // could be implemented better

      // ....
      blockedVertices.flatMap(new SimilarityMeasurerWithDocumentFrequencyBroadcast(similarityComponents))
        .withBroadcastSet(getExecutionEnvironment().fromElements(wordsInDoc),
          TFIDFSimilarityComponent.DOCUMENT_FREQUENCY_BROADCAST);

Also changed TFIDF to use the pre calculated documentsFrequency

Also Closes #5 (closed)

Edited 4 years ago by Tinsaye Abye

Activity

Tinsaye Abye changed milestone to %POC 4 years ago

changed milestone to %POC
Tinsaye Abye added enhancement label 4 years ago

added enhancement label
Tinsaye Abye mentioned in commit c5dae6ba 4 years ago

mentioned in commit c5dae6ba
Tinsaye Abye merged 4 years ago

merged
Tinsaye Abye changed the description 4 years ago

changed the description

Please register or sign in to reply

[feature] document frequency count functions and broadcasting

Merge request reports

Merged by Tinsaye Abye 4 years ago (May 29, 2020 12:36am UTC) 4 years ago

Activity