README.md 3.36 KB
Newer Older
Sebastian Hellmann's avatar
Sebastian Hellmann committed
1
2
# MARVIN-config

Sebastian Hellmann's avatar
Sebastian Hellmann committed
3
MARVIN is the release bot that does automated DBpedia releases each month on three different servers for generic, mappings, wikidata, abstract extraction. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
4
The repository at https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config can be used to fork the architecture for creating extensions, developing new extractors or debugging old ones. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
5
Fixes and patches will be manually deployed via a fresh `git clone` from the `master` branch of the [DIEF (DBpedia Information Extraction Framework)](https://github.com/dbpedia/extraction-framework/). 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
6
7
8
9
10
11

## Contributions & License
All scripts and config files in this repo are CC-0 (Public Domain). 
We accept pull requests to improve the config files, all contributions will be merged as CC-0. 

## Run a MARVIN extraction
Sebastian Hellmann's avatar
Sebastian Hellmann committed
12

Sebastian Hellmann's avatar
Sebastian Hellmann committed
13
14
15
```
git clone https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config
cd marvin-config
Sebastian Hellmann's avatar
Sebastian Hellmann committed
16
17
18
19
# delete previous versions of the DIEF
rm -rf marvin-config/extraction-framework
./setup-dief.sh
# test Romanian extraction, very small
Sebastian Hellmann's avatar
Sebastian Hellmann committed
20
21
22
23
24
25
26
27
28
29
30
31
./marvin_extraction_run.sh --group=test
```

To run the other extractions, use either
```
# around 4-7 days
./marvin_extraction_run.sh --group=generic
# around 4-7 days
./marvin_extraction_run.sh --group=mappings
# around 7-14 days
./marvin_extraction_run.sh --group=wikidata
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
32

Sebastian Hellmann's avatar
Sebastian Hellmann committed
33
## Cronjobs
Sebastian Hellmann's avatar
Sebastian Hellmann committed
34

Sebastian Hellmann's avatar
Sebastian Hellmann committed
35
Below is a list of cronjobs we use on the different servers
Sebastian Hellmann's avatar
Sebastian Hellmann committed
36

Sebastian Hellmann's avatar
Sebastian Hellmann committed
37

Sebastian Hellmann's avatar
Sebastian Hellmann committed
38
## Acknowledgements
Sebastian Hellmann's avatar
Sebastian Hellmann committed
39
40
41
42
43
44
We thank Sören Auer and the Technische Informationsbibliothek (TIB) for providing three servers to run:

* the main DBpedia extraction on a monthly basis 
* community-provided extractors on Wikipedia, Wikidata or other sources 
* enrichment, cleaning and parsing services, so-called [Databus mods](https://github.com/dbpedia/databus-mods/) for open data on the Databus

Sebastian Hellmann's avatar
Sebastian Hellmann committed
45
This contribution by TIB to DBpedia & its community is a great push towards incentivizing Open Data and establishing a global and national research and innovation data infrastructure. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
46
47
48
49
50
51

# Workflow

## Downloading the wikimedia dumps
TODO

Sebastian Hellmann's avatar
Sebastian Hellmann committed
52
## Update and Run the extraction
Sebastian Hellmann's avatar
Sebastian Hellmann committed
53
54
TODO

Sebastian Hellmann's avatar
Sebastian Hellmann committed
55
## Deploy MARVIN on Databus
Sebastian Hellmann's avatar
Sebastian Hellmann committed
56
57
TODO

Sebastian Hellmann's avatar
Sebastian Hellmann committed
58
## [Manual] Run Databus-Derive (clone and parse)
Sebastian Hellmann's avatar
readme    
Sebastian Hellmann committed
59
60
61
62
63
64
65
66
67
68
69
70
71
72
On the respective server there is a user marvin-fetch, that has access to `/data/derive` containing the pom.xml of https://github.com/dbpedia/databus-maven-plugin/tree/master/dbpedia

```
# query to get all versions fro derive in xml syntax to paste directly into pom.xml
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
SELECT distinct (?derive) WHERE {

    ?dataset dataid:group <https://databus.dbpedia.org/marvin/generic> .
    ?dataset dataid:artifact ?artifact .
    ?dataset dataid:version ?version .
    ?dataset dct:hasVersion "2019.08.30"^^xsd:string
	BIND (CONCAT("<version>",?artifact,"/${databus.deriveversion}</version>") as ?derive)
}
order by asc(?derive)
Sebastian Hellmann's avatar
Sebastian Hellmann committed
73
```
Sebastian Hellmann's avatar
readme    
Sebastian Hellmann committed
74
75
76
77
78
79
80
81
82
83
84
85


```
su marvin-fetch
tmux a -t derive
WHAT=mappings
NEWVERSION=2019.08.30
# prepare
cd /data/derive/databus-maven-plugin/dbpedia/$WHAT
git pull
mvn versions:set -DnewVersion=$NEWVERSION
# run
Sebastian Hellmann's avatar
Sebastian Hellmann committed
86
mvn databus-derive:clone -Ddatabus.deriveversion=$NEWVERSION
Sebastian Hellmann's avatar
readme    
Sebastian Hellmann committed
87
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
88

Sebastian Hellmann's avatar
Sebastian Hellmann committed
89
## [Manual] pull data to downloads.dbpedia.org server
Sebastian Hellmann's avatar
Sebastian Hellmann committed
90
91
run marvin-fetch.sh script in databus/dbpedia folder

Sebastian Hellmann's avatar
Sebastian Hellmann committed
92
93
94
95
96
97
```
cd /media/bigone/25TB/releases/databus-maven-plugin/dbpedia
./marvin-fetch.sh wikidata 2019.08.01

```

Sebastian Hellmann's avatar
Sebastian Hellmann committed
98
## Deploy cleaned files to dbpedia
Sebastian Hellmann's avatar
Sebastian Hellmann committed
99

Sebastian Hellmann's avatar
Sebastian Hellmann committed
100
101
102
103
104
105
```
cd /media/bigone/25TB/releases/databus-maven-plugin/dbpedia/mappings
mvn clean 
mvn validate
mvn -T 8 deploy
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
106