Both the clustering entry and the postprocessing entry are optional, allowing clustering and postprocessing to be performed separately. So the overall structure may look like this:
```json
{
"clustering":{
...
}
}
```
or this:
```json
{
"postprocessing":{
...
}
}
```
## Helper structures
### ClusteringOutputType
A clustering algorithm is able to return different types of clustered graph.
```json
{
GRAPH,
GRAPH_COLLECTION,
VERTEX_SET
}
```
`GRAPH` here is of type `LogicalGraph`, where egdes between two different clusters are allowed. `GRAPH_COLLECTION` here is of type `LogicalGraph`, too, but edges between two different clusters are removed. `VERTEX_SET` is a `LogicalGraph` consisting only of the vertices from the clustered graph.
**So all clustering algorithms have a `LogicalGraph` as input and a `LogicalGraph` as ouput type.**
### PrioritySelection
Cluster center selection based on the vertex priority comparison
```json
{
MIN,
MAX
}
```
## Parallel clustering algorithms
This is an overview for all available parallel clustering algorithms. All algorithms are using "maxIteration", which can mostly be set to `Integer.MAX_VALUE`. So "maxIteration" can be set to either "MAX_VALUE" or a specific integer value in the json configuration.
### Center algorithm
```json
{
"clusteringMethod":"CENTER",
"prioritySelection":"MIN",
"isEdgesBiDirected":false,
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
### CLIP algorithm:
```json
{
"clusteringMethod":"CLIP",
"clipConfig":{
"delta":"0.0",
"sourceNumber":"1",
"removeSourceConsistentVertices":false,
"simValueCoef":"0.5",
"degreeCoef":"0.2",
"strengthCoef":"0.3"
},
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
### ConnectedComponents algorithm
```json
{
"clusteringMethod":"CONNECTED_COMPONENTS",
"clusterIdPrefix":"cc",//optional
"similarityEdgeLabel":"similarityEdge",//optional
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
Note: "similarityEdgeLabel" defines the label of similarity edges. With a given value only edges with this label are considered for building the connected components.
For the preference config, a value of -1 marks the parameter as disabled (false for MinSimilarity). There must be exactly one parameter enabled for CleanSrc and one for DirtySrc. Noise can be disabled by setting noiseDecimalPlace to -1.
- MSCD_AP_SPARSE_GELLY = Sparse MSCD AP gelly implementation (slow)
- MSCD_AP_SPARSE_DS = Sarse MSCD AP DataSet-API implementation (very slow, not recommended)
```json
{
"clusteringMethod":"MSCD_AP_SPARSE_GELLY",
"isEdgesBiDirected":false,
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE",
"apConfig":{
"maxApIteration":20000,
"maxAdaptionIteration":150,
"convergenceIter":15,
"dampingFactor":0.5,
"dampingAdaptionStep":0.1,
"allSameSimClusteringThreshold":0.7,
"noiseDecimalPlace":3,
"allSourcesClean":false,
"sourceDirtinessVertexProperty":"isSrcDirty",
"cleanSources":["ebay.com","amazon.com"],
"preferenceConfig":{
"preferenceUseMinSimilarityDirtySrc":true,
"preferenceUseMinSimilarityCleanSrc":false,
"preferenceFixValueDirtySrc":-1,
"preferenceFixValueCleanSrc":-1,
"preferencePercentileDirtySrc":-1,
"preferencePercentileCleanSrc":30,
"preferenceAdaptionStep":0.05
}
}
```
For the preference config, a value of -1 marks the parameter as disabled (false for MinSimilarity). There must be exactly one parameter enabled for CleanSrc and one for DirtySrc. Noise can be disabled by setting noiseDecimalPlace to -1.
### Star algorithm
```json
{
"clusteringMethod":"STAR",
"prioritySelection":"MIN",
"starType":"ONE"or"TWO",
"isEdgesBiDirected":false,
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
## Cluster post processing
This is an overview for all available cluster postprocessing algorithms.