README.md 3.82 KB
Newer Older
Sebastian Hellmann's avatar
Sebastian Hellmann committed
1
2
# MARVIN-config

Sebastian Hellmann's avatar
Sebastian Hellmann committed
3
MARVIN is the release bot that does automated DBpedia releases each month on three different servers for generic, mappings, wikidata, abstract extraction. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
4
5
[This repository](https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config) can be used to fork the architecture for creating extensions, developing new extractors or debugging old ones. 
Fixes and patches will be deployed on the DBpedia servers each month via a fresh `git clone` from the `master` branch of the [DIEF (DBpedia Information Extraction Framework)](https://github.com/dbpedia/extraction-framework/). 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
6
7
8
9

## Contributions & License
All scripts and config files in this repo are CC-0 (Public Domain). 
We accept pull requests to improve the config files, all contributions will be merged as CC-0. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
10
Marvin-config is intended to bootstrap developing fixes for the DIEF.
Sebastian Hellmann's avatar
Sebastian Hellmann committed
11
12

## Run a MARVIN extraction
Sebastian Hellmann's avatar
Sebastian Hellmann committed
13

Sebastian Hellmann's avatar
Sebastian Hellmann committed
14
15
Implementation note: the scripts creates a folder `marvin-extraction` where the code, results and logs are. 

Sebastian Hellmann's avatar
Sebastian Hellmann committed
16
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
17
# check out this repo with all config files
Sebastian Hellmann's avatar
Sebastian Hellmann committed
18
19
git clone https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config
cd marvin-config
Sebastian Hellmann's avatar
Sebastian Hellmann committed
20
21


22
# (optional) delete previous versions of the DIEF
Sebastian Hellmann's avatar
Sebastian Hellmann committed
23
24
25
# (~10 minutes) install dief in marvin-extraction/extraction-framework
# if you installed it already you can run `git pull && mvn clean install` to update
rm -rf marvin-extraction/extraction-framework
Sebastian Hellmann's avatar
Sebastian Hellmann committed
26
./setup-dief.sh
Sebastian Hellmann's avatar
Sebastian Hellmann committed
27

28
# test run Romanian extraction, very small
Sebastian Hellmann's avatar
Sebastian Hellmann committed
29
30
31
./marvin_extraction_run.sh --group=test
```

Sebastian Hellmann's avatar
Sebastian Hellmann committed
32
To run the other extractions, use either of
Sebastian Hellmann's avatar
Sebastian Hellmann committed
33
34
35
36
37
38
39
40
```
# around 4-7 days
./marvin_extraction_run.sh --group=generic
# around 4-7 days
./marvin_extraction_run.sh --group=mappings
# around 7-14 days
./marvin_extraction_run.sh --group=wikidata
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
41

Sebastian Hellmann's avatar
Sebastian Hellmann committed
42
## Cronjobs
Sebastian Hellmann's avatar
Sebastian Hellmann committed
43

44
45
46
47
48
Below is a list of cronjobs we use on the different servers:

```
TODO
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
49

Sebastian Hellmann's avatar
Sebastian Hellmann committed
50
## 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
51

Sebastian Hellmann's avatar
Sebastian Hellmann committed
52
## Acknowledgements
Sebastian Hellmann's avatar
Sebastian Hellmann committed
53
54
55
56
57
58
We thank Sören Auer and the Technische Informationsbibliothek (TIB) for providing three servers to run:

* the main DBpedia extraction on a monthly basis 
* community-provided extractors on Wikipedia, Wikidata or other sources 
* enrichment, cleaning and parsing services, so-called [Databus mods](https://github.com/dbpedia/databus-mods/) for open data on the Databus

Sebastian Hellmann's avatar
Sebastian Hellmann committed
59
This contribution by TIB to DBpedia & its community is a great push towards incentivizing Open Data and establishing a global and national research and innovation data infrastructure. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
60

61
62
63
# Workflow Description

## 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
64
65
66
67

## Downloading the wikimedia dumps
TODO

Sebastian Hellmann's avatar
Sebastian Hellmann committed
68
## Update and Run the extraction
Sebastian Hellmann's avatar
Sebastian Hellmann committed
69
70
TODO

Sebastian Hellmann's avatar
Sebastian Hellmann committed
71
## Deploy MARVIN on Databus
Sebastian Hellmann's avatar
Sebastian Hellmann committed
72
73
TODO

Sebastian Hellmann's avatar
Sebastian Hellmann committed
74
## [Manual] Run Databus-Derive (clone and parse)
Sebastian Hellmann's avatar
readme    
Sebastian Hellmann committed
75
76
77
78
79
80
81
82
83
84
85
86
87
88
On the respective server there is a user marvin-fetch, that has access to `/data/derive` containing the pom.xml of https://github.com/dbpedia/databus-maven-plugin/tree/master/dbpedia

```
# query to get all versions fro derive in xml syntax to paste directly into pom.xml
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
SELECT distinct (?derive) WHERE {

    ?dataset dataid:group <https://databus.dbpedia.org/marvin/generic> .
    ?dataset dataid:artifact ?artifact .
    ?dataset dataid:version ?version .
    ?dataset dct:hasVersion "2019.08.30"^^xsd:string
	BIND (CONCAT("<version>",?artifact,"/${databus.deriveversion}</version>") as ?derive)
}
order by asc(?derive)
Sebastian Hellmann's avatar
Sebastian Hellmann committed
89
```
Sebastian Hellmann's avatar
readme    
Sebastian Hellmann committed
90
91
92
93
94
95
96
97
98
99
100
101


```
su marvin-fetch
tmux a -t derive
WHAT=mappings
NEWVERSION=2019.08.30
# prepare
cd /data/derive/databus-maven-plugin/dbpedia/$WHAT
git pull
mvn versions:set -DnewVersion=$NEWVERSION
# run
Sebastian Hellmann's avatar
Sebastian Hellmann committed
102
mvn databus-derive:clone -Ddatabus.deriveversion=$NEWVERSION
Sebastian Hellmann's avatar
readme    
Sebastian Hellmann committed
103
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
104

Sebastian Hellmann's avatar
Sebastian Hellmann committed
105
## [Manual] pull data to downloads.dbpedia.org server
Sebastian Hellmann's avatar
Sebastian Hellmann committed
106
107
run marvin-fetch.sh script in databus/dbpedia folder

Sebastian Hellmann's avatar
Sebastian Hellmann committed
108
109
110
111
112
113
```
cd /media/bigone/25TB/releases/databus-maven-plugin/dbpedia
./marvin-fetch.sh wikidata 2019.08.01

```

Sebastian Hellmann's avatar
Sebastian Hellmann committed
114
## Deploy cleaned files to dbpedia
Sebastian Hellmann's avatar
Sebastian Hellmann committed
115

Sebastian Hellmann's avatar
Sebastian Hellmann committed
116
117
118
119
120
121
```
cd /media/bigone/25TB/releases/databus-maven-plugin/dbpedia/mappings
mvn clean 
mvn validate
mvn -T 8 deploy
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
122