Update Home/Configuration/Clustering Configuration (JSON) authored by Matthias Taeschner's avatar Matthias Taeschner
## Overall structure
The overall structure of the clustering and cluster postprocessing json config looks like this:
```json
{
"clustering":{
"clusteringMethod":"CENTER",
"prioritySelection":"MIN",
"isEdgesBiDirected":false,
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
},
"postprocessing":{
"postprocessingMethod":"OVERLAP_RESOLVE_NO_MERGE",
"delta":"0.5",
"runPhase2":false
}
}
```
Both the clustering entry and the postprocessing entry are optional, allowing clustering and postprocessing to be performed separately. So the overall structure may look like this:
```json
{
"clustering":{
...
}
}
```
or this:
```json
{
"postprocessing":{
...
}
}
```
## Helper structures
### ClusteringOutputType
A clustering algorithm is able to return different types of clustered graph.
```json
{
GRAPH,
GRAPH_COLLECTION,
VERTEX_SET
}
```
`GRAPH` here is of type `LogicalGraph`, where egdes between two different clusters are allowed. `GRAPH_COLLECTION` here is of type `LogicalGraph`, too, but edges between two different clusters are removed. `VERTEX_SET` is a `LogicalGraph` consisting only of the vertices from the clustered graph.
**So all clustering algorithms have a `LogicalGraph` as input and a `LogicalGraph` as ouput type.**
### PrioritySelection
Cluster center selection based on the vertex priority comparison
```json
{
MIN,
MAX
}
```
## Parallel clustering algorithms
This is an overview for all available parallel clustering algorithms. All algorithms are using "maxIteration", which can mostly be set to `Integer.MAX_VALUE`. So "maxIteration" can be set to either "MAX_VALUE" or a specific integer value in the json configuration.
### Center algorithm
```json
{
"clusteringMethod":"CENTER",
"prioritySelection":"MIN",
"isEdgesBiDirected":false,
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
### CLIP algorithm:
```json
{
"clusteringMethod":"CLIP",
"clipConfig":{
"delta":"0.0",
"sourceNumber":"1",
"removeSourceConsistentVertices":false,
"simValueCoef":"0.5",
"degreeCoef":"0.2",
"strengthCoef":"0.3"
},
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
### ConnectedComponents algorithm
```json
{
"clusteringMethod":"CONNECTED_COMPONENTS",
"clusterIdPrefix":"cc", //optional
"similarityEdgeLabel":"similarityEdge", //optional
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
Note: "similarityEdgeLabel" defines the label of similarity edges. With a given value only edges with this label are considered for building the connected components.
### CorrelationClustering algorithm
```json
{
"clusteringMethod":"CORRELATION_CLUSTERING",
"epsilon":"0.9",
"isEdgesBiDirected":false,
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
### LimitedCorrelationClustering algorithm
```json
{
"clusteringMethod":"LIMITED_CORRELATION_CLUSTERING",
"epsilon":"0.9",
"centerType":"graph source label for vertices allowed to be cluster center"
"isEdgesBiDirected":false,
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
### MergeCenter algorithm
```json
{
"clusteringMethod":"MERGE_CENTER",
"prioritySelection":"MIN",
"simDegMergeThreshold":"0.5",
"isEdgesBiDirected":false,
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
### Multi-Source Clean-Dirty Hierarchical Affinity Propagation algorithm (MSCD-HAP)
```json
{
"clusteringMethod":"MSCD_HAP",
"isEdgesBiDirected": false,
"clusteringOutputType": "GRAPH",
"maxIteration": "MAX_VALUE",
"hapConfig": {
"maxApIteration": 20000,
"maxAdaptionIteration": 150,
"convergenceIter": 15,
"dampingFactor": 0.5,
"dampingAdaptionStep": 0.1,
"allSameSimClusteringThreshold": 0.7,
"noiseDecimalPlace": 3,
"allSourcesClean": false,
"sourceDirtinessVertexProperty": "isSrcDirty",
"cleanSources": [ "ebay.com", "amazon.com" ],
"maxPartitionSize": 1000,
"maxHierarchyDepth": 10,
"hapExemplarAssignmentStrategy": "HUNGARIAN", // OR: HIGHEST_SIMILARITY
"preferenceConfig": {
"preferenceUseMinSimilarityDirtySrc": true,
"preferenceUseMinSimilarityCleanSrc": false,
"preferenceFixValueDirtySrc": -1,
"preferenceFixValueCleanSrc": -1,
"preferencePercentileDirtySrc": -1,
"preferencePercentileCleanSrc": 30,
"preferenceAdaptionStep": 0.05
}
}
```
For the preference config, a value of -1 marks the parameter as disabled (false for MinSimilarity). There must be exactly one parameter enabled for CleanSrc and one for DirtySrc. Noise can be disabled by setting noiseDecimalPlace to -1.
### Multi-Source Clean-Dirty Sparse Affinity Propagation algorithm (MSCD-AP)
2 different implementations:
- MSCD_AP_SPARSE_GELLY = Sparse MSCD AP gelly implementation (slow)
- MSCD_AP_SPARSE_DS = Sarse MSCD AP DataSet-API implementation (very slow, not recommended)
```json
{
"clusteringMethod":"MSCD_AP_SPARSE_GELLY",
"isEdgesBiDirected": false,
"clusteringOutputType": "GRAPH",
"maxIteration": "MAX_VALUE",
"apConfig": {
"maxApIteration": 20000,
"maxAdaptionIteration": 150,
"convergenceIter": 15,
"dampingFactor": 0.5,
"dampingAdaptionStep": 0.1,
"allSameSimClusteringThreshold": 0.7,
"noiseDecimalPlace": 3,
"allSourcesClean": false,
"sourceDirtinessVertexProperty": "isSrcDirty",
"cleanSources": [ "ebay.com", "amazon.com" ],
"preferenceConfig": {
"preferenceUseMinSimilarityDirtySrc": true,
"preferenceUseMinSimilarityCleanSrc": false,
"preferenceFixValueDirtySrc": -1,
"preferenceFixValueCleanSrc": -1,
"preferencePercentileDirtySrc": -1,
"preferencePercentileCleanSrc": 30,
"preferenceAdaptionStep": 0.05
}
}
```
For the preference config, a value of -1 marks the parameter as disabled (false for MinSimilarity). There must be exactly one parameter enabled for CleanSrc and one for DirtySrc. Noise can be disabled by setting noiseDecimalPlace to -1.
### Star algorithm
```json
{
"clusteringMethod":"STAR",
"prioritySelection":"MIN",
"starType":"ONE" or "TWO",
"isEdgesBiDirected":false,
"clusteringOutputType":"GRAPH",
"maxIteration":"MAX_VALUE"
}
```
## Cluster post processing
This is an overview for all available cluster postprocessing algorithms.
### OverlapResolveNoMerge algorithm
```json
{
"postprocessingMethod":"OVERLAP_RESOLVE_NO_MERGE",
"delta":"0.5",
"runPhase2":false
}
```
-----------------
[Back](https://git.informatik.uni-leipzig.de/dbs/FAMER/wikis/home)
\ No newline at end of file