Changes

Matthias Taeschner · 4ad02f60
--- a/Home/Configuration/Clustering-Configuration-(JSON).md
+++ b/Home/Configuration/Clustering-Configuration-(JSON).md
+## Overall structure
+
+The overall structure of the clustering and cluster postprocessing json config looks like this:
+```json
+{
+  "clustering":{
+    "clusteringMethod":"CENTER",
+    "prioritySelection":"MIN",
+    "isEdgesBiDirected":false,
+    "clusteringOutputType":"GRAPH",
+    "maxIteration":"MAX_VALUE"
+  },
+  "postprocessing":{
+    "postprocessingMethod":"OVERLAP_RESOLVE_NO_MERGE",
+    "delta":"0.5",
+    "runPhase2":false
+  }
+}
+```
+
+Both the clustering entry and the postprocessing entry are optional, allowing clustering and postprocessing to be performed separately. So the overall structure may look like this:
+
+```json
+{
+  "clustering":{
+    ...
+  }
+}
+```
+or this:
+
+```json
+{
+  "postprocessing":{
+    ...
+  }
+}
+```
+
+## Helper structures
+
+### ClusteringOutputType
+A clustering algorithm is able to return different types of clustered graph.
+```json
+{
+  GRAPH,
+  GRAPH_COLLECTION,
+  VERTEX_SET
+}
+```
+`GRAPH` here is of type `LogicalGraph`, where egdes between two different clusters are allowed. `GRAPH_COLLECTION` here is of type `LogicalGraph`, too, but edges between two different clusters are removed. `VERTEX_SET` is a `LogicalGraph` consisting only of the vertices from the clustered graph.
+
+**So all clustering algorithms have a `LogicalGraph` as input and a `LogicalGraph` as ouput type.**
+
+### PrioritySelection
+Cluster center selection based on the vertex priority comparison
+```json
+{
+  MIN,
+  MAX
+}
+```
+
+## Parallel clustering algorithms
+
+This is an overview for all available parallel clustering algorithms. All algorithms are using "maxIteration", which can mostly be set to `Integer.MAX_VALUE`. So "maxIteration" can be set to either "MAX_VALUE" or a specific integer value in the json configuration.
+
+### Center algorithm
+```json
+{
+  "clusteringMethod":"CENTER",
+  "prioritySelection":"MIN",
+  "isEdgesBiDirected":false,
+  "clusteringOutputType":"GRAPH",
+  "maxIteration":"MAX_VALUE"
+}
+```
+
+### CLIP algorithm:
+```json
+{
+  "clusteringMethod":"CLIP",
+  "clipConfig":{
+    "delta":"0.0",
+    "sourceNumber":"1",
+    "removeSourceConsistentVertices":false,
+    "simValueCoef":"0.5",
+    "degreeCoef":"0.2",
+    "strengthCoef":"0.3"
+  },
+  "clusteringOutputType":"GRAPH",
+  "maxIteration":"MAX_VALUE"
+}
+```
+
+### ConnectedComponents algorithm
+```json
+{
+  "clusteringMethod":"CONNECTED_COMPONENTS",
+  "clusterIdPrefix":"cc", //optional
+  "similarityEdgeLabel":"similarityEdge", //optional
+  "clusteringOutputType":"GRAPH",
+  "maxIteration":"MAX_VALUE"
+}
+```
+Note: "similarityEdgeLabel" defines the label of similarity edges. With a given value only edges with this label are considered for building the connected components.
+
+### CorrelationClustering algorithm
+```json
+{
+  "clusteringMethod":"CORRELATION_CLUSTERING",
+  "epsilon":"0.9",
+  "isEdgesBiDirected":false,
+  "clusteringOutputType":"GRAPH",
+  "maxIteration":"MAX_VALUE"
+}
+```
+
+### LimitedCorrelationClustering algorithm
+```json
+{
+  "clusteringMethod":"LIMITED_CORRELATION_CLUSTERING",
+  "epsilon":"0.9",
+  "centerType":"graph source label for vertices allowed to be cluster center"
+  "isEdgesBiDirected":false,
+  "clusteringOutputType":"GRAPH",
+  "maxIteration":"MAX_VALUE"
+}
+```
+
+### MergeCenter algorithm
+```json
+{
+  "clusteringMethod":"MERGE_CENTER",
+  "prioritySelection":"MIN",
+  "simDegMergeThreshold":"0.5",
+  "isEdgesBiDirected":false,
+  "clusteringOutputType":"GRAPH",
+  "maxIteration":"MAX_VALUE"
+}
+```
+
+### Multi-Source Clean-Dirty Hierarchical Affinity Propagation algorithm (MSCD-HAP)
+
+```json
+{
+"clusteringMethod":"MSCD_HAP",
+"isEdgesBiDirected": false,
+"clusteringOutputType": "GRAPH",
+"maxIteration": "MAX_VALUE",
+"hapConfig": {
+  "maxApIteration": 20000,
+  "maxAdaptionIteration": 150,
+  "convergenceIter": 15,
+  "dampingFactor": 0.5,
+  "dampingAdaptionStep": 0.1,
+  "allSameSimClusteringThreshold": 0.7,
+  "noiseDecimalPlace": 3,
+  "allSourcesClean": false,
+  "sourceDirtinessVertexProperty": "isSrcDirty",
+  "cleanSources": [ "ebay.com", "amazon.com" ],
+  "maxPartitionSize": 1000,
+  "maxHierarchyDepth": 10,
+  "hapExemplarAssignmentStrategy": "HUNGARIAN",     // OR: HIGHEST_SIMILARITY
+  "preferenceConfig": {
+    "preferenceUseMinSimilarityDirtySrc": true,
+    "preferenceUseMinSimilarityCleanSrc": false,
+    "preferenceFixValueDirtySrc": -1,
+    "preferenceFixValueCleanSrc": -1,
+    "preferencePercentileDirtySrc": -1,
+    "preferencePercentileCleanSrc": 30,
+    "preferenceAdaptionStep": 0.05
+  }
+}
+```
+For the preference config, a value of -1 marks the parameter as disabled (false for MinSimilarity). There must be exactly one parameter enabled for CleanSrc and one for DirtySrc. Noise can be disabled by setting noiseDecimalPlace to -1.
+
+### Multi-Source Clean-Dirty Sparse Affinity Propagation algorithm (MSCD-AP)
+2 different implementations:
+- MSCD_AP_SPARSE_GELLY = Sparse MSCD AP gelly implementation (slow)
+- MSCD_AP_SPARSE_DS = Sarse MSCD AP DataSet-API implementation (very slow, not recommended)
+
+```json
+{
+"clusteringMethod":"MSCD_AP_SPARSE_GELLY",
+"isEdgesBiDirected": false,
+"clusteringOutputType": "GRAPH",
+"maxIteration": "MAX_VALUE",
+"apConfig": {
+  "maxApIteration": 20000,
+  "maxAdaptionIteration": 150,
+  "convergenceIter": 15,
+  "dampingFactor": 0.5,
+  "dampingAdaptionStep": 0.1,
+  "allSameSimClusteringThreshold": 0.7,
+  "noiseDecimalPlace": 3,
+  "allSourcesClean": false,
+  "sourceDirtinessVertexProperty": "isSrcDirty",
+  "cleanSources": [ "ebay.com", "amazon.com" ],
+  "preferenceConfig": {
+    "preferenceUseMinSimilarityDirtySrc": true,
+    "preferenceUseMinSimilarityCleanSrc": false,
+    "preferenceFixValueDirtySrc": -1,
+    "preferenceFixValueCleanSrc": -1,
+    "preferencePercentileDirtySrc": -1,
+    "preferencePercentileCleanSrc": 30,
+    "preferenceAdaptionStep": 0.05
+  }
+}
+```
+For the preference config, a value of -1 marks the parameter as disabled (false for MinSimilarity). There must be exactly one parameter enabled for CleanSrc and one for DirtySrc. Noise can be disabled by setting noiseDecimalPlace to -1.
+
+### Star algorithm
+```json
+{
+  "clusteringMethod":"STAR",
+  "prioritySelection":"MIN",
+  "starType":"ONE" or "TWO",
+  "isEdgesBiDirected":false,
+  "clusteringOutputType":"GRAPH",
+  "maxIteration":"MAX_VALUE"
+}
+```
+
+## Cluster post processing
+
+This is an overview for all available cluster postprocessing algorithms.
+
+### OverlapResolveNoMerge algorithm
+```json
+{
+  "postprocessingMethod":"OVERLAP_RESOLVE_NO_MERGE",
+  "delta":"0.5",
+  "runPhase2":false
+}
+```
+-----------------
+[Back](https://git.informatik.uni-leipzig.de/dbs/FAMER/wikis/home)
\ No newline at end of file