Commit 43d11c65 authored by Jerome Wuerf's avatar Jerome Wuerf
Browse files

Improve readme

parent 9b0ca61c
# Code
## Basic setup
This is the argument retrieval system of team hitgirl. Source code is located under `./python/src`.
### Create python env for dev tools.
![system_architecure](./system_architecture.png)
## CLI
The CLI interface of our system has two sub commands: `indexing` and `retrieval`.
The entery to the runtime can be found in `./docker/docker-compose.dev.yaml`.
### Indexing
```bash
usage: app.py indexing [-h] [--elastic-host ELASTIC_HOST] [--create]
sentences_path embeddings_path
positional arguments:
sentences_path The file path to the csv file containing the
sentences.
embeddings_path The file path to the embeddings of the argument units.
optional arguments:
-h, --help show this help message and exit
--elastic-host ELASTIC_HOST
The hostname of the server/docker container that runs
elastic search.
--create If flag is present two new indices are created,
overriding existing ones.
```
### Retrieval
```bash
usage: app.py retrieval [-h] [--topic-nrb TOPIC_NRB]
[--nrb-conclusions-per-topic NRB_CONCLUSIONS_PER_TOPIC]
[--nrb-premises-per-conclusion NRB_PREMISES_PER_CONCLUSION]
[--min-length-factor MIN_LENGTH_FACTOR]
[--reranking {maximal-marginal-relevance,structural-distance,argument-rank,word-mover-distance}]
[--reuse-unranked REUSE_UNRANKED]
[--lambda-conclusions LAMBDA_CONCLUSIONS]
[--lambda-premises LAMBDA_PREMISES]
[--mu-conclusions MU_CONCLUSIONS]
[--mu-premises MU_PREMISES] [--wait-for-es]
run_name input_path output_path
positional arguments:
run_name The run name that will be included in the last column
of the trec file.
input_path The file path to the directory containing the input
files.
output_path The file path to the directory containing the output
files.
optional arguments:
-h, --help show this help message and exit
--topic-nrb TOPIC_NRB
Restrict the current indexing and/or reranking to a
given topic number.
--nrb-conclusions-per-topic NRB_CONCLUSIONS_PER_TOPIC
The number of conclusions that should be retrieved
from the index per topic.
--nrb-premises-per-conclusion NRB_PREMISES_PER_CONCLUSION
The number of premises that should be retrieved from
the index per conclusion.
--min-length-factor MIN_LENGTH_FACTOR
--reranking {maximal-marginal-relevance,structural-distance,argument-rank,word-mover-distance}
--reuse-unranked REUSE_UNRANKED
--lambda-conclusions LAMBDA_CONCLUSIONS
--lambda-premises LAMBDA_PREMISES
--mu-conclusions MU_CONCLUSIONS
--mu-premises MU_PREMISES
--wait-for-es
```
## Development setup
The setup is optimized for MSFT's VSCode.
### Create python env for dev tools
Dev tools will live in their own virtual env. This is better for osx systems.
......@@ -13,27 +88,21 @@ $ pip install -r requirements.devtools.txt
$ pre-commit install
```
### Enable linter and formatter in VScode
1. Open settings `.vscode/settings.dist.json`
2. Change `INSERT_USERNAME_HERE` to your current uname
3. Rename the file to `settings.json`
### Debugging
1. Right click on the docker-compose.dev.yaml
2. Wait until container setup is ready (watch magic happening in the terminal)
Debugger is attached to the running containers via port `5678`. Look at `.vscode/launch.json` for
information.
1. Right click on the `./docker/docker-compose.dev.yaml`
2. Wait until container is ready
3. Open `python/src/prototype/app.py`
4. Set a Break Point
5. Open debugger Window
6. Execute Debugger (click green paly button)
6. Execute Debugger
Hope it helps see you soon!
### Data Set: processed Args.me
### Data Set
The provided dataset is a cluster fuck.
We have an csv with the following schema.
csv with the following schema.
`id, conclusion, premises, context, sentences`
......@@ -60,11 +129,9 @@ We have an csv with the following schema.
- `sourceTitle`
- `sourceUrl`
- `sentences` is a string of a list of json objects
- one json object in the list corresponds to one sentence
- one json object in the list corresponds to one sentence
- the last object is the conclusion
- all objects preceeding are premisis
- one object has the following keys
- `sent_id`
- `sent_text`
- in the sentence attribute 692 conclusions are occurring word by word in one of the premises
\ No newline at end of file
......@@ -19,28 +19,28 @@ services:
- elastic
elastic:
image: "docker.elastic.co/elasticsearch/elasticsearch:7.15.2"
restart: always
networks:
- tira
ports:
- "9200:9200"
- "9300:9300"
volumes:
- /mnt/data/elastic:/usr/share/elasticsearch/data
- ./conifg:/conifg
environment:
- discovery.type=single-node
- logger.level=DEBUG
healthcheck:
test:
[
"CMD",
"curl",
"-s",
"-f",
"http://localhost:9200/_cat/health"
]
# elastic:
# image: "docker.elastic.co/elasticsearch/elasticsearch:7.15.2"
# restart: always
# networks:
# - tira
# ports:
# - "9200:9200"
# - "9300:9300"
# volumes:
# - /mnt/data/elastic:/usr/share/elasticsearch/data
# - ./conifg:/conifg
# environment:
# - discovery.type=single-node
# - logger.level=DEBUG
# healthcheck:
# test:
# [
# "CMD",
# "curl",
# "-s",
# "-f",
# "http://localhost:9200/_cat/health"
# ]
networks:
tira: null
......@@ -10,6 +10,7 @@ import time
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s', level='INFO')
logging.Logger.manager.loggerDict["elastic_transport.transport"].disabled = True
class App:
def __init__(self, configuration: Configuration):
......@@ -74,13 +75,17 @@ class App:
self.config['LAMBDA_PREMISES']
)
elif self.config['RERANKING'] == RerankingOptions.STRUCTURAL_DISTANCE.value:
reranker = StructuralDistanceReranking(retrieved_results, self.config['RUN_NAME'],
topics)
reranker = StructuralDistanceReranking(
retrieved_results, self.config['RUN_NAME'],
topics, self.config['MU_PREMISES'],
self.config['MU_CONCLUSIONS'])
elif self.config['RERANKING'] == RerankingOptions.ARGUMENT_RANK.value:
reranker = ArgumentRankReranking(retrieved_results, self.config['RUN_NAME'], topics)
elif self.config['RERANKING'] == RerankingOptions.WORD_MOVER_DISTANCE.value:
reranker = WordMoverDistanceReranking(retrieved_results, self.config['RUN_NAME'],
topics)
reranker = WordMoverDistanceReranking(
retrieved_results, self.config['RUN_NAME'],
topics, self.config['MU_PREMISES'],
self.config['MU_CONCLUSIONS'])
else:
reranker = NoReranking(retrieved_results, self.config['RUN_NAME'])
......
......@@ -27,6 +27,8 @@ class Configuration():
'REUSE_UNRANKED',
'LAMBDA_CONCLUSIONS',
'LAMBDA_PREMISES',
'MU_CONCLUSIONS',
'MU_PREMISES',
'WAIT_FOR_ES'],
}
......@@ -62,6 +64,8 @@ class Configuration():
args.reuse_unranked,
args.lambda_conclusions,
args.lambda_premises,
args.mu_conclusions,
args.mu_premises,
args.wait_for_es
]
......
......@@ -10,7 +10,7 @@ class Text:
"""
TODO
"""
description = '$$$$ Graph based Argument Mining on Sentence Embeddings $$$$'
description = "Hit-Girl'Argument Retrieval System"
indexing = 'Sub command to create a semantic index with elastic search.'
elastic_host = 'The hostname of the server/docker container that runs elastic search.'
create = 'If flag is present two new indices are created, overriding existing ones.'
......@@ -85,8 +85,16 @@ def parse_cli_args() -> argparse.Namespace:
type=float,
required=False,
default=0.5)
parser_retrieval.add_argument('--mu-conclusions',
type=float,
required=False,
default=0.9)
parser_retrieval.add_argument('--mu-premises',
type=float,
required=False,
default=0.75)
parser_retrieval.add_argument('--wait-for-es',
action='store_true')
action='store_true')
parser_retrieval.add_argument('run_name', type=str, help=Text.run_name)
parser_retrieval.add_argument('input_path', type=str, help=Text.input_path)
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment