Skip to content
Snippets Groups Projects

[feature] document frequency count functions and broadcasting

Merged Tinsaye Abye requested to merge 2-document-frequency-pre-calculation into develop

Closes #2 (closed)

Implements to functions to calculate the document frequency

  • ExtractWords is a FlatMap to map sentences to words per vertex
  • this can then be grouped by words
  • CountDocumentFrequencyis a groupReduce function to count the amount of documents a word occurs.

SimilarityMeasurerWithDocumentFrequencyBroadcast gets the broadcasted variable and sets it to TFIDFSimilarityComponent.

The Linker should calculate the DocumentFrequency and broadcast it to the DOCUMENT_FREQUENCY_BROADCAST variable.

    Map<String, Integer> wordsInDoc =
      benchmarkDataCollection.getVertices().flatMap(new WordExtractor(sourceAttribute, tokenizer))
        .groupBy(1)
        .reduceGroup(new CountDocumentFrequency())
        .collect().stream().collect(Collectors.toMap(t1 -> t1.f0, t2 -> t2.f1)); // could be implemented better

      // ....
      blockedVertices.flatMap(new SimilarityMeasurerWithDocumentFrequencyBroadcast(similarityComponents))
        .withBroadcastSet(getExecutionEnvironment().fromElements(wordsInDoc),
          TFIDFSimilarityComponent.DOCUMENT_FREQUENCY_BROADCAST);

Also changed TFIDF to use the pre calculated documentsFrequency

Also Closes #5 (closed)

Edited by Tinsaye Abye

Merge request reports

Approval is optional

Merged by Tinsaye AbyeTinsaye Abye 4 years ago (May 29, 2020 12:36am UTC)

Merge details

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply
Loading