Commits · 137a2e50c310a53d1e41ed5eee3f7bc5149c6ad3 · ocr-d / cor-asv-fst

Dec 15, 2018
- renamed pre-built lexicon transducers for dta19-reduced, added same for dta19 · 137a2e50
  Robert Sachunsky authored 6 years ago
  
  137a2e50
- amend f25d826d7: in processor, when splitting input into symbols and flags,... · af52ba8e
  Robert Sachunsky authored 6 years ago
  
  amend f25d826d7: in processor, when splitting input into symbols and flags, fix a new crash with decomposed/combining characters
  af52ba8e
- force GPU for spaCy if requested, add comments on further problematic cases · a8c94e90
  Robert Sachunsky authored 6 years ago
  
  a8c94e90
- replace slow and memory/stack-devouring alignment.sequencealigner by fast... · 29aa54ae
  Robert Sachunsky authored 6 years ago
  
  replace slow and memory/stack-devouring alignment.sequencealigner by fast difflib.SequenceMatcher in evaluation too
  29aa54ae
Dec 14, 2018
- allow training error transducer on CSV file instead of directory, plus Python... · 5b01d251
  Robert Sachunsky authored 6 years ago
  
  allow training error transducer on CSV file instead of directory, plus Python 2 compatibility and some Pylint cosmetics
  5b01d251
Nov 21, 2018

add option to use GPU for spaCy · 2095714e
Robert Sachunsky authored 6 years ago

2095714e
add precision-recall evaluation metric · e094bb2f
Robert Sachunsky authored 6 years ago

e094bb2f

allow calling with rejection weight -1 for efficient ROC measurements · 68a367b5

with REJECTION_WEIGHT set to -1, behave differently:
- pyComposition (if enabled) does no backoff_result()
- union with input (and input weight mapping) is instead
  done in compose_and_search afterwards, but with a long
  vector of sensible thresholds to each apply alternatively
  on the same window result (i.e. without having to rerun
  everything)
- so all results up the call graph now have to be vectorized
  as well: compose_and_search, create_result_transducer,
  window_size_1_2, main/process;
  in the normal case, there is only 1 value in that vector

68a367b5

improved processor: · e4ef86b9

Robert Sachunsky authored 6 years ago

- catch and show exceptions among pool workers
- exit with failure if exceptions occurred in the end

e4ef86b9

fixed processor again: · 28fb7f96

Robert Sachunsky authored 6 years ago

- normalize Unicode strings to normal form,
  and respect remaining decomposed characters
  when splitting input into symbols and flags
  (fixes crashes)

- amend 4ebb55e5: without openFST, as_transducer can still be True already

28fb7f96

composition also needs to merge symbol tables for input transducers! · bd253ede
Robert Sachunsky authored 6 years ago

bd253ede

Nov 16, 2018

addition for 13c50055 · 02f07b01

Robert Sachunsky authored 6 years ago

during lexicon extraction, also add '/' as infix and suffix to Spacy's tokenizer

02f07b01

correction for d3eab2d0ccc: · ceb750a6

Robert Sachunsky authored 6 years ago

fix transducer definitions:

- when repeating lexicon transducer according to words_per_window,
  the last token takes a space character as well

- further repair inter-word/lm lexicon model:
  - last token also needs a flag acceptor (and a space)
  - edits deleting a space should delete the corresponding flag
    in this model too

ceb750a6

improve lexicon extraction: · 3d9fc6b5

Robert Sachunsky authored 6 years ago

- allow (large) input files with more than 1 line
- use generators (strip lines and split at newline)
- prune lexicon with combined absolute (<=3) and
  relative (<1e-5) frequency threshold
- extend number normalization for numerals with
  decimal point and thousands separators
- normalize umlauts to always use decomposed form
  with diacritical combining e
- speed up by disabling parser and NER in Spacy
- add '—' as infix to Spacy's tokenizer
- add CLI, make available as parameters:
  dictionary path, GT suffix

3d9fc6b5

improve lexicon transducer definitions: · 95e92e6c

Robert Sachunsky authored 6 years ago

- when extending lexicon transducer according to composition_depth,
  do not ignore upper/lower case completely, but ensure that
  non-first words are downcased (with infix/zero connection) or
  only upper case (with hyphen connection), and that first words
  are upcased or already upper case

- when extending lexicon transducer with morphology,
  compose *after* compounds were added

- when using lexicon transducer, make sure to allow both precomposed umlauts
  and decomposed (with diacritical combining e);
  also, ensure the final lexicon becomes but an acceptor

- when repeating lexicon transducer according to words_per_window,
  use 1 to N instead of 0 to N (optionalized lexicon), but make sure
  the last (1) token has no space

- repair inter-word/lm lexicon model previously defunct:
  - by stripping initial space from loaded punctuation_right_transducer
  - by correctly synchronizing on flags

95e92e6c

with temporary files as OpenFST interface, use sensible filename patterns, and... · fc41831a
Robert Sachunsky authored 6 years ago
```
with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
```
fc41831a
with temporary files as OpenFST interface, use sensible filename patterns, and... · 9174504a
Robert Sachunsky authored 6 years ago
```
with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
```
9174504a
ensure tokenization is consistent with lexicon (special case '—') · c38995ce
Robert Sachunsky authored 6 years ago

c38995ce

when combining windows, search for the next existing flag instead of blindly... · bca6846b

Robert Sachunsky authored 6 years ago

when combining windows, search for the next existing flag instead of blindly assuming next counting flag always remains in some path even after word merge (fixes crash)

bca6846b

flag acceptor must not be optional (about tenfold speed-up) · 474fde17
Robert Sachunsky authored 6 years ago

474fde17
Python-side create_input_transducer is still needed when composing without... · 199e10e3
Robert Sachunsky authored 6 years ago
```
Python-side create_input_transducer is still needed when composing without OpenFST backend (amend d22110f4)
```
199e10e3

in sliding_window's main, never prune away (all) results (backout from... · 2b5e686f

Robert Sachunsky authored 6 years ago

in sliding_window's main, never prune away (all) results (backout from d22110f4 here) – hfst's prune has a documentation bug!

2b5e686f

measure CPU time instead of wall-clock time · fcaa13db
Robert Sachunsky authored 6 years ago

fcaa13db

Nov 14, 2018

fixed and improved FST processor: · e94e2119

Robert Sachunsky authored 6 years ago

- for file-based interface between hfst and openfst,
  instead of fixed filenames in CWD (partially using
  the input string in the filename),
  use proper temporary filenames
- allow preparing input transducers both in Python or in C++,
  control via prepare_input(as_transducer)
- after loading error transducer, ensure its alphabet /
  symbol table contains all special flags
- properly tokenize input strings using the special flags
  (both in Python, prepare_input / create_input_transducer,
   and in C++, create_input_transducer) --
  fixes crashes due to edited flags
- for C++ create_input_transducer, use a SYMBOL StringCompiler
  with the current symbol table, override fst_field_separator
  to only use newlines; join on newlines before call in Python
- delegate reweighting input transducer with rejection weight,
  disjoining that with result transducer, and determinization
  to C++ part if using OpenFST backend
- always make result transducer coaccessible/connected to ensure
  determinize will terminate,
  fixes crashes
- make get_flag_states never repeat states,
  fixes crashes during merge
- make merge_states faster and more robust
- add more German compound infixes for lexicon composition depth
- improve C++ part:
  - rename functions sensibly
  - transparently encode/decode python strings to/from
    UTF-8 byte strings
  - avoid dragging along parameters for transducers and files
  - use (safe) pointers for FSTs instead of copy-by-value,
    re-use mutable objects as often as possible
  - use static storage for buffers
  - use members variables for converters (StringCompiler, WeightConvertMapper)
- use logging (verbosity can be controlled via loglevels on CLI)
- make logging and output more consistent, no input strings
  as filenames

WARNING: quality will be slightly better, but extremely slow
         reason still unknown

e94e2119

accelerate creation of error transducers: · 200030c4

Robert Sachunsky authored 6 years ago

- go back to strings instead of lists for n-grams and
  FST definition
- replace 0 as gap element by non-breaking space,
  ensure that this does not occur anywhere in input
- avoid slow and suboptimal FST definition from dictionary,
  instead iteratively disjoin n-gram pair tuples (SPV)
- also embed special flags for sliding window construction
  into error transducers alphabet / symbol list

200030c4

repair creation of error transducers: · 276db049
Robert Sachunsky authored 6 years ago
```
- fix loop adding contexts
- use repeat_n_minus instead of optionalize+repeat_n
```
276db049
add dot-based graph drawer script · 110f4998
Robert Sachunsky authored 6 years ago

110f4998

Nov 06, 2018

improve creation of error transducers: · d573abb2

Robert Sachunsky authored 6 years ago

- replace gap element 'ε' by unambiguous 0 throughout
  (requires using lists instead of strings between aligner and
   FST definition)
- to count edits, use only 1-best alignment (ignore suboptimal ones),
  include well-aligned lines (up to 100% identical) but
  exclude pathological cases (less than 5% match), also
  ignore empty lines
- replace slow and memory/stack-devouring alignment.sequencealigner
  by fast difflib.SequenceMatcher
- avoid string and file conversion (write+read) for edit dictionary
  (just optionally write a human-readable file)
- fix and simplify no-punctuation filter
- add CLI, also make available as arguments:
  - maximum context size per edit,
  - maximum number of errors per window,
  - whether or not to preserve punctuation

WARNING: the error models created by that appear to perform worse than the
original ones still registered in the repo

commented because countereffective:
- combine simple error transducers by also disjoining full error transducer N
  with full error transducer N-1 (where N=0 is acceptor)

d573abb2

make rejection threshold available on CLI, resurrect LM · fb4d33f3

Robert Sachunsky authored 6 years ago

- pass value of CLI option for rejection threshold on
  to set_transition_weights (via global constant, in lieu
  of better encapsulation)
- resurrect option to apply language model for rescoring,
  (slows down significantly and deteriorates results for now),
  add CLI option

fb4d33f3

make choice over evaluation metric (pure Levenshtein vs combining-e-umlauts) available on CLI · 8b7ba794
Robert Sachunsky authored 6 years ago

8b7ba794

improve one-shot CLI, improve rejection threshold · 58127ce0

Robert Sachunsky authored 6 years ago

- resurrect inter-word and preserve model here as well
- resurrect application of LM
- improve CLI, make available as arguments:
  - punctuation model (bracket vs inter-word/LM vs preserve)
  - words per window
  - number of result paths per window
  - lexicon composition depth
- reduce transition weights for input transducer (joined to result transducer),
  acting as rejection threshold: significantly reduces overcorrection

58127ce0

backout from 50556c011f, prevent empty windows by fixing prepare_input · 18a6b9f5
Robert Sachunsky authored 6 years ago

18a6b9f5

Nov 05, 2018
- resurrect bracket and inter-word model, add CLI choice option · 13a41eb5
  Robert Sachunsky authored 6 years ago
  
  13a41eb5
Nov 04, 2018

multiprocessing parallelization · d01e72f0
Robert Sachunsky authored 6 years ago

d01e72f0
gitignore and Python dependencies · 7ba3ed8f
Robert Sachunsky authored 6 years ago

7ba3ed8f

fixed transducer directory, added CLI: · 5cc978cb

Robert Sachunsky authored 6 years ago

- add CLI, make available as arguments: directory path, OCR and correction suffix,
  window size, composition depth, result paths
- give use full control of prefix (do not assume .txt)

5cc978cb

fixed alignment/distance, added CLI: · 62a16e8e

Robert Sachunsky authored 6 years ago

- use zero as gap element (to avoid confusion with true ε
- use StrictGlobalSequenceAligner (to make sure the complete sequence gets used)
- make extra metric for umlauts (confused with diacritical combining e) actually work
  (proper state changes, no length adjustment)
- simplify micro-averaged CER calculation (edit and length counts)
- add CLI, make available as arguments: directory path, OCR and correction suffix
- give user full control of prefix (do not assume .txt)

62a16e8e

fixed sliding window when no results · c11c7eb2
Robert Sachunsky authored 6 years ago

c11c7eb2

Nov 01, 2018
- add missing fst models · 172cb60d
  Lena Schiffer authored 6 years ago
  
  172cb60d
Oct 01, 2018
- give input string as command-line parameter in FST approach · 22e0fa23
  Lena Schiffer authored 6 years ago
  
  22e0fa23