Skip to content

[feature] document frequency count functions and broadcasting

Tinsaye Abye requested to merge 2-document-frequency-pre-calculation into develop

Closes #2 (closed)

Implements to functions to calculate the document frequency

  • ExtractWords is a FlatMap to map sentences to words per vertex
  • this can then be grouped by words
  • CountDocumentFrequencyis a groupReduce function to count the amount of documents a word occurs.

SimilarityMeasurerWithDocumentFrequencyBroadcast gets the broadcasted variable and sets it to TFIDFSimilarityComponent.

The Linker should calculate the DocumentFrequency and broadcast it to the DOCUMENT_FREQUENCY_BROADCAST variable.

    Map<String, Integer> wordsInDoc =
      benchmarkDataCollection.getVertices().flatMap(new WordExtractor(sourceAttribute, tokenizer))
        .groupBy(1)
        .reduceGroup(new CountDocumentFrequency())
        .collect().stream().collect(Collectors.toMap(t1 -> t1.f0, t2 -> t2.f1)); // could be implemented better

      // ....
      blockedVertices.flatMap(new SimilarityMeasurerWithDocumentFrequencyBroadcast(similarityComponents))
        .withBroadcastSet(getExecutionEnvironment().fromElements(wordsInDoc),
          TFIDFSimilarityComponent.DOCUMENT_FREQUENCY_BROADCAST);

Also changed TFIDF to use the pre calculated documentsFrequency

Also Closes #5 (closed)

Edited by Tinsaye Abye

Merge request reports