[feature] document frequency count functions and broadcasting
Closes #2 (closed)
Implements to functions to calculate the document frequency
-
ExtractWords
is a FlatMap to map sentences to words per vertex - this can then be grouped by words
-
CountDocumentFrequency
is a groupReduce function to count the amount of documents a word occurs.
SimilarityMeasurerWithDocumentFrequencyBroadcast
gets the broadcasted variable and sets it to TFIDFSimilarityComponent
.
The Linker should calculate the DocumentFrequency and broadcast it to the DOCUMENT_FREQUENCY_BROADCAST
variable.
Map<String, Integer> wordsInDoc =
benchmarkDataCollection.getVertices().flatMap(new WordExtractor(sourceAttribute, tokenizer))
.groupBy(1)
.reduceGroup(new CountDocumentFrequency())
.collect().stream().collect(Collectors.toMap(t1 -> t1.f0, t2 -> t2.f1)); // could be implemented better
// ....
blockedVertices.flatMap(new SimilarityMeasurerWithDocumentFrequencyBroadcast(similarityComponents))
.withBroadcastSet(getExecutionEnvironment().fromElements(wordsInDoc),
TFIDFSimilarityComponent.DOCUMENT_FREQUENCY_BROADCAST);
Also changed TFIDF to use the pre calculated documentsFrequency
Also Closes #5 (closed)
Edited by Tinsaye Abye