README.md 4.86 KB
Newer Older
Martin Franke's avatar
Init.  
Martin Franke committed
1
2
3
4
5
6
# PRIMAT: Private Matching Toolbox

<img src="primat_logo.png" width="250">


PRIMAT is an open source (ALv2) toolbox for the definition and execution of PPRL workflows. 
Martin Franke's avatar
Martin Franke committed
7
8
It offers modules for data owners and the linkage unit that provide state-of-the-art PPRL methods,
including Bloom-filter-based encoding and hardening techniques, LSH-based blocking, post-processing (clustering) and more.
Martin Franke's avatar
Init.  
Martin Franke committed
9
10
11
12


[PRIMAT](https://dl.acm.org/citation.cfm?doid=3352063.3360392) is developed by the [Database Group](https://dbs.uni-leipzig.de/research/projects/pper_big_data) of the University of Leipzig, Germany.

Martin Franke's avatar
Martin Franke committed
13
14
15
16
17
18
19
20
## Using PRIMAT

To use PRIMAT in your project, simply add the following dependency to your build tool

```xml
<dependency>
    <groupId>de.uni-leipzig.dbs.pprl</groupId>
    <artifactId>primat-data-owner</artifactId>
Martin Franke's avatar
Martin Franke committed
21
    <version>1.0.3</version>
Martin Franke's avatar
Martin Franke committed
22
23
24
25
26
27
28
29
30
</dependency>
```

for data owner components, including pre-processing and encoding methods, or

```xml
<dependency>
    <groupId>de.uni-leipzig.dbs.pprl</groupId>
    <artifactId>primat-linkage-unit</artifactId>
Martin Franke's avatar
Martin Franke committed
31
    <version>1.0.3</version>
Martin Franke's avatar
Martin Franke committed
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
</dependency>
```

for linkage unit components, including linkage and post-processing (clustering) methods.

## PRIMAT Modules

- `primat-common` - Contains shared data model and various utility function, e.g, for input file handling, hashing, feature extraction.

- `primat-data-owner` - Contains typical pre-processing functions as well as techniques to encode/mask records for PPRL.

- `primat-linkage-unit` - Provides functionalities for batch and incremental linkage workflows, including blocking, similarity calculation, classification, post-processing (clustering) and evaluation.

- `primat-examples` - Contains example workflows showing use cases for PRIMAT. 

Martin Franke's avatar
Martin Franke committed
47
48
- `primat-analysis` - Modul containing tools for analyzing records and error types

Martin Franke's avatar
Init.  
Martin Franke committed
49
50
51
52
53
54
55
56
57
58
59
60
61
62
## Privacy-preserving Record Linkage

- Task of identifying record in different databases reffering to the same person
- Protection of sensitive personal information
- Applications in medicine & healthcare, national security and marketing analysis

<img src="https://user-images.githubusercontent.com/20927034/118960531-acfb8e00-b963-11eb-894e-ecffafbd8f87.png" width="500">

### Key Challenges

- Gurantee privacy by minimizing disclosure risk
- Scalability to millions of records
- High linkage quality

Martin Franke's avatar
Martin Franke committed
63
## PRIMAT: Overview
Martin Franke's avatar
Init.  
Martin Franke committed
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

- PPRL tool covering the entire PPRL life-cycle
- Flexible definition and execution of PPRL workflows
- Comparative evaluation of PPRL approaches
- Modules for both data owner and the trusted linkage unit

<img src=https://user-images.githubusercontent.com/20927034/118961272-707c6200-b964-11eb-9ed9-8264e04cc840.png width="500">

### State-of-the-art PPRL Methods

#### Bloom filter encodings & hardening techniques

<img src=https://user-images.githubusercontent.com/20927034/118971359-9c511500-b96f-11eb-8f41-986724c7db92.png width="400">

#### Fast & private blocking/filtering techniques

<img src=https://user-images.githubusercontent.com/20927034/118971495-ca365980-b96f-11eb-88fb-b7478288a2dc.png width="400">

#### Post-processing methods for one-to-one link restriction

<img src=https://user-images.githubusercontent.com/20927034/118971617-f05bf980-b96f-11eb-8bcc-d1d0a4a0114e.png width="400">


### Functional Overview 

|Component/Module | Function/Feature | Status |
|-----------------|------------------|--------|
Martin Franke's avatar
Martin Franke committed
91
| Data generator & corruptor | - Data generation<br> - Data corruption | Integration outstanding<br>Planned |
Martin Franke's avatar
Init.  
Martin Franke committed
92
| Data cleaning | - Split/merge/remove attributes<br>- Replace/remove unwanted values<br>- OCR transformation | Implemented<br>Implemented<br>Implemented |
Martin Franke's avatar
Martin Franke committed
93
| Encoding | - Bloom filter encoding<br>- Bloom filter hardening techniques<br>- Support of alternative encoding schemes| Implemented<br>Implemented<br>Partially implemented |
Martin Franke's avatar
Martin Franke committed
94
95
| Blocking | - Standard blocking<br> - LSH-based blocking| Implemented<br>Implemented |
| Matching | - Threshold-based classification<br>- Post-processing<br>- Multi-threaded execution<br>- Distributed matching<br>- Multi-Party support, match cluster management<br>- Incremental Matching | Implemented<br>Implemented<br>Partially implemented<br>Integration outstanding<br>Implemented<br>Implemented |
Martin Franke's avatar
Martin Franke committed
96
| Evaluation | - Measures for assessing quality & scalability<br>- Masked match result visualization | Implemented<br>Integration outstanding |
Martin Franke's avatar
Init.  
Martin Franke committed
97
98
99
100
101
102

### Requirements

- Java 11+ 
- Maven
- Ubuntu (recommended)
Martin Franke's avatar
Martin Franke committed
103
- PostgreSQL (for incremental matching)
Martin Franke's avatar
Init.  
Martin Franke committed
104

Martin Franke's avatar
Martin Franke committed
105
106
107
#### Database Setup

- Required for incremental matching 
Martin Franke's avatar
Martin Franke committed
108
- Create new PostgreSQL database named `primat`
Martin Franke's avatar
Martin Franke committed
109
- Edit the `persistence.xml` file under `/primat-linkage-unit/src/main/resources/META-INF/persistence.xml` and change the fields username and password according to your configuration
Martin Franke's avatar
Init.  
Martin Franke committed
110
111
112

## Future Plans

113
We plan to gradually add new features releated to our ongoing research.
Martin Franke's avatar
Init.  
Martin Franke committed
114

Martin Franke's avatar
Martin Franke committed
115
116
117
## Contributors

- Florens Rohde
Martin Franke's avatar
Martin Franke committed
118
- Victor Christen
Martin Franke's avatar
Martin Franke committed
119
120
- Ziad Sehili
- Thomas Hoppe
Martin Franke's avatar
Martin Franke committed
121
- Duc Dung Dao
Martin Franke's avatar
Martin Franke committed
122
- Marcel Gladbach