Commit 86c23ae7 authored by Robert Sachunsky's avatar Robert Sachunsky
Browse files

initial release…

first functional version; fixed a-priori 1:1:1 model:port:server dependency

model files kept separate (only copied into worktree to build Docker image)
parents
# cf. github.com/OCR-D/ocrd_all
# ocrd/all # ocrd/core # ubuntu:18.04
ARG VERSION=maximum-cuda-git
FROM ocrd/all:$VERSION
MAINTAINER sachunsky@informatik.uni-leipzig.de
# keep PREFIX and VIRTUAL_ENV from ocrd/all
# but export them for COPY etc
ENV PREFIX=$PREFIX
ENV VIRTUAL_ENV=$VIRTUAL_ENV
# make apt run non-interactive during build
ENV DEBIAN_FRONTEND noninteractive
# make apt system functional
RUN apt-get update && \
apt-get install -y apt-utils wget git && \
apt-get clean
WORKDIR /build
RUN ln /usr/bin/python3 /usr/bin/python
ENV TESSDATA=${VIRTUAL_ENV}/share/tessdata
RUN mkdir -p $TESSDATA $TESSDATA/script
RUN wget -P $TESSDATA http://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata
# replace Tesseract Docker build with version from PPA (with OpenMP):
RUN make -W tesseract install-tesseract TESSERACT_CONFIG='CXXFLAGS="-g -O2 -fPIC"'
ENV LD_LIBRARY_PATH $PREFIX/lib
RUN cd tesserocr && pip install -e .
# workaround for imgaug#473 (opencv-python/headless)
RUN pip install --no-binary imgaug imgaug
# ensure Tensorflow is configured to use CUDA
RUN pip install tensorflow_gpu==1.15.4
# let h5py match TF 1.15 generated models
RUN pip install h5py==2.10
# remove ocrd_segment from sub-venv
RUN rm $PREFIX/bin/ocrd-segment*
# add ocrd_segment address detection branch
RUN git -C ocrd_segment fetch origin maskrcnn-cli
RUN git -C ocrd_segment checkout maskrcnn-cli
# install in top-level, not in sub-venv
RUN pip install -e ocrd_segment
# update to ocrd_cis#77
RUN git -C ocrd_cis fetch origin pull/77/head:fix-resegment
RUN git -C ocrd_cis checkout fix-resegment
RUN pip install -e ocrd_cis
# update to core#652
RUN git -C core fetch origin pull/652/head:workflow-server
RUN git -C core checkout workflow-server
RUN make -C core install PIP_INSTALL="pip install -e"
# update `make server` bridge
RUN git -C workflow-configuration pull origin master
RUN make -C workflow-configuration install
# add further workflow configurations
COPY *.mk $PREFIX/share/workflow-configuration/
# add model files for classify-formdata-layout
ENV MRCNNDATA=${VIRTUAL_ENV}/share/ocrd_segment
RUN mkdir -p $MRCNNDATA
COPY *.h5 $MRCNNDATA/
# configure writing to ocrd.log for profiling
COPY ocrd_logging.conf /etc
ENV DEBIAN_FRONTEND teletype
WORKDIR /data
VOLUME /data
# entrypoint is OCR-D workflow server webservice (1 fixed MDL model per server)
# use with `docker run -e MDL=Techem` to override
ENV MDL=Brunata
CMD make -I /usr/share/workflow-configuration -f /usr/share/workflow-configuration/conreform-sparse-tesseract-deu.mk server MDL=$MDL PORT=7001 HOST=0.0.0.0
# start with `docker run -p N:7001` and query with `ocrd workflow client -p N`
EXPOSE 7001
TAGNAME = bertsky/conreform
build:
docker build -t $(TAGNAME) .
run: DATA ?= $(CURDIR)
run: MDL ?= Brunata
run: PORT ?= 7001
run:
docker run -e MDL=$(MDL) -p $(PORT):7001 -v $(DATA):/data $(TAGNAME)
halt: PORT ?= 7001
halt:
ocrd workflow client -p $(PORT) shutdown
.PHONY: build run halt
# conreform
AI backend for SmartHEC project: OCR extraction of relevant information from scanned forms via context recognition
Defines a Docker service that runs an [OCR-D](https://ocr-d.de) [workflow](https://ocr-d.de/en/spec/glossary#ocr-d-workflow) for text extraction of predefined form fields (visual object classes) from scanned/photographed forms on given [OCR-D workspaces](https://ocr-d.de/en/spec/glossary#workspace). The workspace is assumed to contain nothing but a fileGrp `OCR-D-IMG` with the raw images, and will be annotated up to a final fileGrp `OCR-D-OCR-TESS-deu-SEG-tesseract-sparse-FORM-OCR` with [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) representing the final result.
## Installation
Install [Docker CE](https://docs.docker.com/install/) and GNU make. Copy `*.h5` (HDF5-formatted) model files trained by [Mask-RCNN for formdata](https://github.com/OCR-D/ocrd_segment/blob/maskrcnn-cli/maskrcnn-cli/formdata.py) into the CWD. Then:
make build
## Usage
1. Start one workflow server for each model (i.e. billing company / `Messdienstleister`):
make run DATA=$PWD MDL=Brunata PORT=7001 &
make run DATA=$PWD MDL=Techem PORT=7002 &
make run DATA=$PWD MDL=Ista PORT=7003 &
> Note: In the future, MDL will be classified on-the-fly, so only 1 model / server will be needed.
2. Create a [workspace](https://ocr-d.de/en/user_guide#preparing-a-workspace) for each group of image files (belonging to one bill) you want to analyse:
ocrd-import -P brunata-dir1
ocrd-import -P brunata-dir2
ocrd-import -P techem-dir
> Note: This could also be created via some [METS](http://www.loc.gov/standards/mets/) template file.
3. For each workspace, issue a processing request to the workflow server. You need to know which model/server to use a priori:
ocrd workflow client -p 7001 process -m brunata-dir1/mets.xml
ocrd workflow client -p 7001 process -m brunata-dir2/mets.xml
ocrd workflow client -p 7002 process -m techem-dir/mets.xml
(or equivalently:)
curl -G -d mets=brunata-dir1/mets.xml http://127.0.0.1:7001/process
curl -G -d mets=brunata-dir2/mets.xml http://127.0.0.1:7001/process
curl -G -d mets=techem-dir/mets.xml http://127.0.0.1:7002/process
> Note: Best use a path relative to the `DATA` directory bind-mounted when starting the workflow server.
4. To stop a running server, issue a shutdown request to the workflow server, or stop the respective docker container:
ocrd workflow client -p 7001 shutdown
ocrd workflow client -p 7002 shutdown
(or equivalently:)
make halt PORT=7001
make halt PORT=7002
(or equivalently:)
curl -G http://127.0.0.1:7001/shutdown
curl -G http://127.0.0.1:7002/shutdown
To query resulting PAGE-XML (`pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"`) for
- text of a _context_ annotation for class `gebaeude_heizkosten_raumwaerme`:
```xpath
//*[contains(@custom,"subtype:context=gebaeude_heizkosten_raumwaerme")]/pc:TextEquiv[1]/pc:Unicode/text()
```
(`*` because it could be `pc:TextLine` or `pc:Word`)
- respective polygon outline (as white-space separated points, each a comma-separated x/y pair):
```xpath
//*[contains(@custom,"subtype:context=gebaeude_heizkosten_raumwaerme")]/pc:Coords/@points
```
(all coordinates relate to image under `/pc:PcGts/pc:Page/@imageFilename`)
- text of a _target_ annotation for class `gebaeude_heizkosten_raumwaerme`:
```xpath
//pc:TextLine[contains(@custom,"subtype:target=gebaeude_heizkosten_raumwaerme")]/pc:TextEquiv[1]/pc:Unicode/text()
```
- respective polygon outline:
```xpath
//pc:TextLine[contains(@custom,"subtype:target=gebaeude_heizkosten_raumwaerme")]/pc:Coords/@points
```
# Install by copying (or symlinking) makefiles into a directory
# where all OCR-D workspaces (unpacked BagIts) reside. Then
# chdir to that location.
# Call via:
# `make -f WORKFLOW-CONFIG.mk WORKSPACE-DIRS` or
# `make -f WORKFLOW-CONFIG.mk all` or just
# `make -f WORKFLOW-CONFIG.mk`
# To rebuild partially, you must pass -W to recursive make:
# `make -f WORKFLOW-CONFIG.mk EXTRA_MAKEFLAGS="-W FILEGRP"`
# To get help on available goals:
# `make help`
###
# From here on, custom configuration begins.
info:
@echo "Read image and create PAGE-XML for it,"
@echo "then crop, binarize, deskew and denoise pages,"
@echo "then segment into sparse text regions/lines,"
@echo "then recognize text with Tesseract model deu,"
@echo "then classify lines into non/context for forms,"
@echo "then segment into target regions/lines for forms,"
@echo "and finally re-OCR new segments into numbers-only."
INPUT = OCR-D-IMG
BIN = $(INPUT)-BINPAGE-sauvola
$(BIN): $(INPUT)
$(BIN): TOOL = ocrd-olena-binarize
$(BIN): PARAMS = "impl": "sauvola-ms-split" # , "k": 0.34 # threshold (larger=thinner)
DEN = $(BIN)-DENOISE-ocropy
$(DEN): $(BIN)
$(DEN): TOOL = ocrd-cis-ocropy-denoise
$(DEN): PARAMS = "level-of-operation": "page", "dpi": 0, "noise_maxsize": 3.0 # max fg/bg noise in pt
FLIP = $(DEN)-DESKEW-tesserocr
$(FLIP): $(DEN)
$(FLIP): TOOL = ocrd-tesserocr-deskew
$(FLIP): PARAMS = "operation_level": "page", "dpi": 0, "min_orientation_confidence": 3.0
DESK = $(FLIP)-DESKEW-ocropy
$(DESK): $(FLIP)
$(DESK): TOOL = ocrd-cis-ocropy-deskew
$(DESK): PARAMS = "level-of-operation": "page", "maxskew": 10 # max angle in degrees (larger=slower)
OCR = OCR-D-OCR-TESS-deu-SEG-tesseract-sparse
$(OCR): $(DESK)
$(OCR): TOOL = ocrd-tesserocr-recognize
$(OCR): PARAMS = "dpi": 96, "padding": 5, "sparse_text": true, "segmentation_level": "region", "model": "deu"
TEXT = $(OCR)-FORM-TEXT
$(TEXT): $(OCR)
$(TEXT): TOOL = ocrd-segment-classify-formdata-text
$(TEXT): PARAMS = "threshold": 95
LAYOUT = $(OCR)-FORM-LAYOUT
$(LAYOUT): $(TEXT)
$(LAYOUT): GPU = 1
$(LAYOUT): TOOL = ocrd-segment-classify-formdata-layout
$(LAYOUT): PARAMS = "model": "$(MDL).h5"
RECOGNIZED = $(OCR)-FORM-OCR
$(RECOGNIZED): $(LAYOUT)
$(RECOGNIZED): TOOL = ocrd-tesserocr-recognize
$(RECOGNIZED): PARAMS = "model": "deu", \
"overwrite_segments": false, \
"overwrite_text": false, \
"char_whitelist": "0123456789,.-:"
.DEFAULT_GOAL = $(RECOGNIZED)
# Down here, custom configuration ends.
###
include Makefile
# This is a template configuration file to demonstrate
# formats and destinations of log messages with OCR-D.
# It's meant as an example, and should be customized.
# To get into effect, you must put a copy (under the same name)
# into your CWD, HOME or /etc. These directories are searched
# in said order, and the first find wins. When no config file
# is found, the default logging configuration applies (cf. ocrd_logging.py).
#
# mandatory loggers section
# configure loggers with corresponding keys "root",""
# each logger requires a corresponding configuration section below
#
[loggers]
keys=root,ocrd_profile,ocrd_tensorflow,ocrd_shapely_geos,ocrd_PIL
#
# mandatory handlers section
# handle output for logging "channel"
# i.e. console, file, smtp, syslog, http, ...
# each handler requires a corresponding handler configuration section below
#
[handlers]
keys=consoleHandler,fileHandler
#
# optional formatters section
# format message records, to be used differently by logging handlers
# each formatter requires a corresponding formatter section below
#
[formatters]
keys=defaultFormatter,detailedFormatter
#
# default logger "root" configured to use only consoleHandler
#
[logger_root]
level=INFO
handlers=consoleHandler
#
# additional logger configurations can be added
# as separate configuration sections like below
#
# example logger "ocrd_workspace" uses fileHandler and overrides
# default log level "INFO" with custom level "DEBUG"
# "qualname" must match the logger label used in the corresponding
# ocrd modul
# see in the modul-of-interrest (moi)
#
# example configuration entry
#
# logger ocrd.workspace
#
#[logger_ocrd_workspace]
#level=DEBUG
#handlers=fileHandler
#qualname=ocrd.workspace
[logger_ocrd_profile]
level=INFO
handlers=fileHandler
qualname=ocrd.process.profile
#
# logger tensorflow
#
[logger_ocrd_tensorflow]
level=ERROR
handlers=consoleHandler
qualname=tensorflow
#
# logger shapely.geos
#
[logger_ocrd_shapely_geos]
level=ERROR
handlers=consoleHandler
qualname=shapely.geos
#
# logger PIL
#
[logger_ocrd_PIL]
level=INFO
handlers=consoleHandler
qualname=PIL
#
# handle stdout output
#
[handler_consoleHandler]
class=StreamHandler
formatter=defaultFormatter
args=(sys.stdout,)
#
# example logfile handler
# handle output with logfile
#
[handler_fileHandler]
class=FileHandler
formatter=detailedFormatter
args=('ocrd.log','a+')
#
# default log format conforming to OCR-D (https://ocr-d.de/en/spec/cli#logging)
#
[formatter_defaultFormatter]
format=%(asctime)s.%(msecs)03d %(levelname)s %(name)s - %(message)s
datefmt=%H:%M:%S
#
# store more logging context information
#
[formatter_detailedFormatter]
format=%(asctime)s.%(msecs)03d %(levelname)-8s (%(name)s)[%(filename)s:%(lineno)d] - %(message)s
datefmt=%H:%M:%S
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment