README.md 5.41 KB
Newer Older
Marvin Hofer's avatar
Marvin Hofer committed
1
2
# (Deprecated) Moved Now to [https://github.com/dbpedia/marvin-config](https://github.com/dbpedia/marvin-config)

Sebastian Hellmann's avatar
Sebastian Hellmann committed
3
4
# MARVIN-config

Sebastian Hellmann's avatar
Sebastian Hellmann committed
5
MARVIN is the release bot that does automated DBpedia releases each month on three different servers for generic, mappings, wikidata, abstract extraction. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
6
7
[This repository](https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config) can be used to fork the architecture for creating extensions, developing new extractors or debugging old ones. 
Fixes and patches will be deployed on the DBpedia servers each month via a fresh `git clone` from the `master` branch of the [DIEF (DBpedia Information Extraction Framework)](https://github.com/dbpedia/extraction-framework/). 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
8
9
10
11

## Contributions & License
All scripts and config files in this repo are CC-0 (Public Domain). 
We accept pull requests to improve the config files, all contributions will be merged as CC-0. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
12
Marvin-config is intended to bootstrap developing fixes for the DIEF.
Sebastian Hellmann's avatar
Sebastian Hellmann committed
13
14

## Run a MARVIN extraction
Sebastian Hellmann's avatar
Sebastian Hellmann committed
15

Sebastian Hellmann's avatar
Sebastian Hellmann committed
16
17
Implementation note: the scripts creates a folder `marvin-extraction` where the code, results and logs are. 

Sebastian Hellmann's avatar
Sebastian Hellmann committed
18
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
19
# check out this repo with all config files
Sebastian Hellmann's avatar
Sebastian Hellmann committed
20
21
git clone https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config
cd marvin-config
Sebastian Hellmann's avatar
Sebastian Hellmann committed
22
23


24
# (optional) delete previous versions of the DIEF
Sebastian Hellmann's avatar
Sebastian Hellmann committed
25
26
27
# (~10 minutes) install dief in marvin-extraction/extraction-framework
# if you installed it already you can run `git pull && mvn clean install` to update
rm -rf marvin-extraction/extraction-framework
Milan Dojchinovski's avatar
Milan Dojchinovski committed
28
./setup-or-reset-dief.sh
Sebastian Hellmann's avatar
Sebastian Hellmann committed
29

30
# test run Romanian extraction, very small
Marvin Hofer's avatar
Marvin Hofer committed
31
./marvin_extraction_run.sh test
Sebastian Hellmann's avatar
Sebastian Hellmann committed
32
33
```

Sebastian Hellmann's avatar
Sebastian Hellmann committed
34
To run the other extractions, use either of
Sebastian Hellmann's avatar
Sebastian Hellmann committed
35
36
```
# around 4-7 days
Marvin Hofer's avatar
Marvin Hofer committed
37
./marvin_extraction_run.sh generic
Sebastian Hellmann's avatar
Sebastian Hellmann committed
38
# around 4-7 days
Marvin Hofer's avatar
Marvin Hofer committed
39
./marvin_extraction_run.sh mappings
Sebastian Hellmann's avatar
Sebastian Hellmann committed
40
# around 7-14 days
Marvin Hofer's avatar
Marvin Hofer committed
41
./marvin_extraction_run.sh wikidata
Sebastian Hellmann's avatar
Sebastian Hellmann committed
42
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
43

vehnem's avatar
vehnem committed
44
45
46
47
48
49
To specify a different dump-date
```
# Set it in extractionConfiguration/{download|extraction}.*.properties
dump-date=20200301
```
If specified dump-date is newer as current local dumps, then adding it to `extractionConfiguration/download.*.properties` is enough
Sebastian Hellmann's avatar
Sebastian Hellmann committed
50
## Cronjobs
Sebastian Hellmann's avatar
Sebastian Hellmann committed
51

Marvin Hofer's avatar
Marvin Hofer committed
52
Monthly cronjobs of the databus group releases, that include the MARVIN and DBpedia (re-)release:
53
54

```
Marvin Hofer's avatar
Marvin Hofer committed
55
56
# Full Wikidata
0 0 7 * * /bin/bash -c '/data/marvin-config/release-monthly-cron.sh wikidata' >/dev/null 2>&1
Sebastian Hellmann's avatar
Sebastian Hellmann committed
57

Marvin Hofer's avatar
Marvin Hofer committed
58
59
60
# Full Generic & Mappings
0 0 7 * * /bin/bash -c '/data/marvin-config/release-monthly-cron.sh generic && /data/marvin-config/release-monthly-cron.sh mappings' >/dev/null 2>&1
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
61

Sebastian Hellmann's avatar
Sebastian Hellmann committed
62
## Acknowledgements
Sebastian Hellmann's avatar
Sebastian Hellmann committed
63
64
65
66
67
68
We thank Sören Auer and the Technische Informationsbibliothek (TIB) for providing three servers to run:

* the main DBpedia extraction on a monthly basis 
* community-provided extractors on Wikipedia, Wikidata or other sources 
* enrichment, cleaning and parsing services, so-called [Databus mods](https://github.com/dbpedia/databus-mods/) for open data on the Databus

Sebastian Hellmann's avatar
Sebastian Hellmann committed
69
This contribution by TIB to DBpedia & its community is a great push towards incentivizing Open Data and establishing a global and national research and innovation data infrastructure. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
70

71
72
# Workflow Description

vehnem's avatar
vehnem committed
73
## Update and Run the extraction
Sebastian Hellmann's avatar
Sebastian Hellmann committed
74

vehnem's avatar
vehnem committed
75
76
77
To run a generic, mappings, or wikidata extraction the following script will do the rest.
Its default behavior is to create all folder relative to its execution directory.
If you want to adapt some paths you can edit them inside `fucntions.sh`.
Sebastian Hellmann's avatar
Sebastian Hellmann committed
78

vehnem's avatar
vehnem committed
79
80
81
```bash
./marvin-extraction-run.sh
```
kurzum's avatar
readme    
kurzum committed
82
83
84
### Post-Processing
Some extractions require postprocessing. The exact setup can be found in [functions.sh](https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/-/blob/master/functions.sh#L41)
More info about [post-processing](http://dev.dbpedia.org/Post-Processing).
Sebastian Hellmann's avatar
Sebastian Hellmann committed
85

Sebastian Hellmann's avatar
Sebastian Hellmann committed
86
## Deploy MARVIN on Databus
vehnem's avatar
vehnem committed
87
88
89
90
91
92
93
94
95
96
97
98


The `databus-release.sh` script contains the workflow how the extracted files are renamed and copied into a databus-maven-plugin readable structure (.e.g artfact and content variants). 
Further, which parameters are used to deploy the MARVIN relases on the databus.

```bash
./databus-release.sh
```

> **NOTE:** This scrtipt sill uses absolute paths and dependens on the private key of the publishers webid
>
> **TODO:** Refactor `./databus-release.sh`
Sebastian Hellmann's avatar
Sebastian Hellmann committed
99

Sebastian Hellmann's avatar
Sebastian Hellmann committed
100
## [Manual] Run Databus-Derive (clone and parse)
Sebastian Hellmann's avatar
readme    
Sebastian Hellmann committed
101
102
103
104
105
106
107
108
109
110
111
112
113
114
On the respective server there is a user marvin-fetch, that has access to `/data/derive` containing the pom.xml of https://github.com/dbpedia/databus-maven-plugin/tree/master/dbpedia

```
# query to get all versions fro derive in xml syntax to paste directly into pom.xml
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
SELECT distinct (?derive) WHERE {

    ?dataset dataid:group <https://databus.dbpedia.org/marvin/generic> .
    ?dataset dataid:artifact ?artifact .
    ?dataset dataid:version ?version .
    ?dataset dct:hasVersion "2019.08.30"^^xsd:string
	BIND (CONCAT("<version>",?artifact,"/${databus.deriveversion}</version>") as ?derive)
}
order by asc(?derive)
Sebastian Hellmann's avatar
Sebastian Hellmann committed
115
```
Sebastian Hellmann's avatar
readme    
Sebastian Hellmann committed
116
117
118
119
120
121
122
123
124
125
126
127


```
su marvin-fetch
tmux a -t derive
WHAT=mappings
NEWVERSION=2019.08.30
# prepare
cd /data/derive/databus-maven-plugin/dbpedia/$WHAT
git pull
mvn versions:set -DnewVersion=$NEWVERSION
# run
Sebastian Hellmann's avatar
Sebastian Hellmann committed
128
mvn databus-derive:clone -Ddatabus.deriveversion=$NEWVERSION
Sebastian Hellmann's avatar
readme    
Sebastian Hellmann committed
129
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
130

Sebastian Hellmann's avatar
Sebastian Hellmann committed
131
## [Manual] pull data to downloads.dbpedia.org server
Sebastian Hellmann's avatar
Sebastian Hellmann committed
132
133
run marvin-fetch.sh script in databus/dbpedia folder

Sebastian Hellmann's avatar
Sebastian Hellmann committed
134
135
136
137
138
139
```
cd /media/bigone/25TB/releases/databus-maven-plugin/dbpedia
./marvin-fetch.sh wikidata 2019.08.01

```

Sebastian Hellmann's avatar
Sebastian Hellmann committed
140
## Deploy cleaned files to dbpedia
Sebastian Hellmann's avatar
Sebastian Hellmann committed
141

Sebastian Hellmann's avatar
Sebastian Hellmann committed
142
143
144
145
146
147
```
cd /media/bigone/25TB/releases/databus-maven-plugin/dbpedia/mappings
mvn clean 
mvn validate
mvn -T 8 deploy
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
148