Skip to content
Snippets Groups Projects
README.md 4.6 KiB
Newer Older
Sebastian Hellmann's avatar
Sebastian Hellmann committed
# MARVIN-config

Sebastian Hellmann's avatar
Sebastian Hellmann committed
MARVIN is the release bot that does automated DBpedia releases each month on three different servers for generic, mappings, wikidata, abstract extraction. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
[This repository](https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config) can be used to fork the architecture for creating extensions, developing new extractors or debugging old ones. 
Fixes and patches will be deployed on the DBpedia servers each month via a fresh `git clone` from the `master` branch of the [DIEF (DBpedia Information Extraction Framework)](https://github.com/dbpedia/extraction-framework/). 
Sebastian Hellmann's avatar
Sebastian Hellmann committed

## Contributions & License
All scripts and config files in this repo are CC-0 (Public Domain). 
We accept pull requests to improve the config files, all contributions will be merged as CC-0. 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
Marvin-config is intended to bootstrap developing fixes for the DIEF.
Sebastian Hellmann's avatar
Sebastian Hellmann committed

## Run a MARVIN extraction
Sebastian Hellmann's avatar
Sebastian Hellmann committed
Implementation note: the scripts creates a folder `marvin-extraction` where the code, results and logs are. 

Sebastian Hellmann's avatar
Sebastian Hellmann committed
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
# check out this repo with all config files
Sebastian Hellmann's avatar
Sebastian Hellmann committed
git clone https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config
cd marvin-config
# (optional) delete previous versions of the DIEF
Sebastian Hellmann's avatar
Sebastian Hellmann committed
# (~10 minutes) install dief in marvin-extraction/extraction-framework
# if you installed it already you can run `git pull && mvn clean install` to update
rm -rf marvin-extraction/extraction-framework
./setup-dief.sh
# test run Romanian extraction, very small
Marvin Hofer's avatar
Marvin Hofer committed
./marvin_extraction_run.sh test
Sebastian Hellmann's avatar
Sebastian Hellmann committed
```

Sebastian Hellmann's avatar
Sebastian Hellmann committed
To run the other extractions, use either of
Sebastian Hellmann's avatar
Sebastian Hellmann committed
```
# around 4-7 days
Marvin Hofer's avatar
Marvin Hofer committed
./marvin_extraction_run.sh generic
Sebastian Hellmann's avatar
Sebastian Hellmann committed
# around 4-7 days
Marvin Hofer's avatar
Marvin Hofer committed
./marvin_extraction_run.sh mappings
Sebastian Hellmann's avatar
Sebastian Hellmann committed
# around 7-14 days
Marvin Hofer's avatar
Marvin Hofer committed
./marvin_extraction_run.sh wikidata
Sebastian Hellmann's avatar
Sebastian Hellmann committed
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed

Sebastian Hellmann's avatar
Sebastian Hellmann committed
## Cronjobs
Sebastian Hellmann's avatar
Sebastian Hellmann committed

Below is a list of cronjobs we use on the different servers:

```
vehnem's avatar
vehnem committed
# extraction and release for wikidata
0 0 7 * * bin/bash -c 'cd /data/marvin-config/marvin-extraction-run.sh wikidata && ./ && ./databus-release.sh'
Sebastian Hellmann's avatar
Sebastian Hellmann committed

Sebastian Hellmann's avatar
Sebastian Hellmann committed
## 
Sebastian Hellmann's avatar
Sebastian Hellmann committed
## Acknowledgements
Sebastian Hellmann's avatar
Sebastian Hellmann committed
We thank Sören Auer and the Technische Informationsbibliothek (TIB) for providing three servers to run:

* the main DBpedia extraction on a monthly basis 
* community-provided extractors on Wikipedia, Wikidata or other sources 
* enrichment, cleaning and parsing services, so-called [Databus mods](https://github.com/dbpedia/databus-mods/) for open data on the Databus

Sebastian Hellmann's avatar
Sebastian Hellmann committed
This contribution by TIB to DBpedia & its community is a great push towards incentivizing Open Data and establishing a global and national research and innovation data infrastructure. 
# Workflow Description

vehnem's avatar
vehnem committed
## Update and Run the extraction
vehnem's avatar
vehnem committed
To run a generic, mappings, or wikidata extraction the following script will do the rest.
Its default behavior is to create all folder relative to its execution directory.
If you want to adapt some paths you can edit them inside `fucntions.sh`.
vehnem's avatar
vehnem committed
```bash
./marvin-extraction-run.sh
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
## Deploy MARVIN on Databus
vehnem's avatar
vehnem committed


The `databus-release.sh` script contains the workflow how the extracted files are renamed and copied into a databus-maven-plugin readable structure (.e.g artfact and content variants). 
Further, which parameters are used to deploy the MARVIN relases on the databus.

```bash
./databus-release.sh
```

> **NOTE:** This scrtipt sill uses absolute paths and dependens on the private key of the publishers webid
>
> **TODO:** Refactor `./databus-release.sh`
Sebastian Hellmann's avatar
Sebastian Hellmann committed
## [Manual] Run Databus-Derive (clone and parse)
Sebastian Hellmann's avatar
Sebastian Hellmann committed
On the respective server there is a user marvin-fetch, that has access to `/data/derive` containing the pom.xml of https://github.com/dbpedia/databus-maven-plugin/tree/master/dbpedia

```
# query to get all versions fro derive in xml syntax to paste directly into pom.xml
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
SELECT distinct (?derive) WHERE {

    ?dataset dataid:group <https://databus.dbpedia.org/marvin/generic> .
    ?dataset dataid:artifact ?artifact .
    ?dataset dataid:version ?version .
    ?dataset dct:hasVersion "2019.08.30"^^xsd:string
	BIND (CONCAT("<version>",?artifact,"/${databus.deriveversion}</version>") as ?derive)
}
order by asc(?derive)
Sebastian Hellmann's avatar
Sebastian Hellmann committed
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed


```
su marvin-fetch
tmux a -t derive
WHAT=mappings
NEWVERSION=2019.08.30
# prepare
cd /data/derive/databus-maven-plugin/dbpedia/$WHAT
git pull
mvn versions:set -DnewVersion=$NEWVERSION
# run
Sebastian Hellmann's avatar
Sebastian Hellmann committed
mvn databus-derive:clone -Ddatabus.deriveversion=$NEWVERSION
Sebastian Hellmann's avatar
Sebastian Hellmann committed
```
Sebastian Hellmann's avatar
Sebastian Hellmann committed
## [Manual] pull data to downloads.dbpedia.org server
Sebastian Hellmann's avatar
Sebastian Hellmann committed
run marvin-fetch.sh script in databus/dbpedia folder

Sebastian Hellmann's avatar
Sebastian Hellmann committed
```
cd /media/bigone/25TB/releases/databus-maven-plugin/dbpedia
./marvin-fetch.sh wikidata 2019.08.01

```

Sebastian Hellmann's avatar
Sebastian Hellmann committed
## Deploy cleaned files to dbpedia
Sebastian Hellmann's avatar
Sebastian Hellmann committed
```
cd /media/bigone/25TB/releases/databus-maven-plugin/dbpedia/mappings
mvn clean 
mvn validate
mvn -T 8 deploy
```