Commits · cd07bb2b11972716492d889201cdbc8357cb3e48 · ocr-d / cor-asv-fst

Feb 05, 2019

Refactoring and bugfixes in the precision/recall metric · cd07bb2b

Changes in get_precision_recall():
- Refactoring: separating the funtionality of merging alignments from scoring.
- Changed the definition of true/false and positive/negative. The characters
  that are originally wrong and wrongly corrected are now false positives
  (previously: false negatives). This also changes the evaluation results quite
  significantly!
- Bugfix: consider only one-to-one or one-to-zero alignments, but not
  one-to-many. This also changes the results.
- Code cleaning.

cd07bb2b

Feb 01, 2019

minor bugfix · 9fce122f

Maciej Sumalvico authored 6 years ago

`process` was renamed to `correct_string` in 2fc4eb20, but not in line 235.
This commit fixes it.

9fce122f

Jan 30, 2019

Refactored evaluate_correction.main() · 1a76c78c

Maciej Sumalvico authored 6 years ago

- command-line argument parsing moved into a separate function
- computing each evaluation metric over all lines moved into a separate
  function, so that the logic of main() reduces to a simple three-way `if`
- increased the spacing between top-level declarations to
  two blank lines (PEP 8)
- added the -G parameter for providing the ground truth suffix

1a76c78c

Refactoring in process_test_data.correct_string() · 1a2a7ea0

Maciej Sumalvico authored 6 years ago

Isolated some activities into subfuctions for better structuring.

Changed the logging level of showing input/output strings from "info" to
"debug".

1a2a7ea0

Refactoring in process_test_data.py · d8fbb535

Maciej Sumalvico authored 6 years ago

- grouped the globals into two dictionaries: `gl_config` and `model`
- renamed `process` to `correct_string`
- renamed `load_model` to `build_model` (does other things apart from loading)
- isolated some functionality from the `main` function
  - parallel processing of strings -> `parallel_process`
  - printing results -> `print_results`
  - building transducer composition, flag encoder and loading LM transducers ->
    `build_model`
- minimized the availability of globals to increase readability and avoid bugs
  - globals are only visible in `main()` and `correct_string()`, but not in any
    subfunctions that `main()` calls
  - instead of passing `args` (the argument parser) as a global, the dictionary
    `gl_config` is used, which contains only the values used by
    `correct_string()`

d8fbb535

Jan 29, 2019

Refactored the model building functions · 7f42b9c4

Maciej Sumalvico authored 6 years ago

The model-building functions in sliding_window.py (load_transducers_*())
previously contained three kinds of functionalities:
- loading transducers
- variant-specific combining of transducers to a single token acceptor
- variant-independent functionality, which is copy-pasted in all three
  functions (adding flags, converting a single token acceptor to a window etc.)

This commit isolates the variant-independent functionality into smaller
functions build_single_token_acceptor_*(), combines the variant-independent
parts for all three variants in the function build_model() and puts the loading
of transducers outside of the `sliding_window` module.

Furthermore:
- renamed process_test_data.load_transducers() to load_model

7f42b9c4

Refactoring of process_test_data.py · 94f3d9ef

Maciej Sumalvico authored 6 years ago

- isolated loading transducers into a separate function
- isolated preparing the composition of lexicon and model into a separate
  function
- moved process() before main()
- cleaned up commented-out code, old file names etc.

94f3d9ef

Jan 22, 2019
- Replaced '@_EPSILON_SYMBOL_@' with hfst.EPSILON. · e28502b3
  Maciej Sumalvico authored 6 years ago
  
  e28502b3
Jan 21, 2019

Refactored lexicon_transducer.create_lexicon() · 631a3627

Maciej Sumalvico authored 6 years ago

- divided the body into smaller functions
- fixed (in line 140) a bug causing types that are identical in capitalized and
  uncapitalized form (like '—') to be counted double

631a3627

Jan 18, 2019

Finished refactoring of error_transducer.main() · e446f4d1
Maciej Sumalvico authored 6 years ago
```
- refactored the reading of training data
```
e446f4d1

Refactored error_transducer.main() · 03951838

Maciej Sumalvico authored 6 years ago

- functionalities isolated into separated functions:
  - creating a single error transducer
  - combining error transducers
- fixed a bug causing only context = 3 to be considered (line 467 pre-commit,
  previously line 41 in error_transducer_complete.py)
- simplified transducer creation

03951838

Merged error transducer creating scripts · d25d8840

Maciej Sumalvico authored 6 years ago

Merged `error_transducer_complete.py` into `error_transducer.py`, so that one
module is responsible for training an error model.

d25d8840

Some refactoring in error_transducer.py · b9778286

Maciej Sumalvico authored 6 years ago

- isolated parse_arguments() as a separate function
- added a -G parameter (gt_suffix) instead of a fixed suffix
- removed some unnecessary comments

b9778286

Merged lexicon building scripts · a36d24fc

Maciej Sumalvico authored 6 years ago

process_dta_data.py was merged into lexicon_transducer.py so that only a single
module is responsible for building the lexicon.

The lexica are no more saved as plaintext. This information can be easily
obtained with `hfst-fst2strings -w`.

helper.py:
- the logarithm of frequencies is computed in the normalizing function, rather
  than during writing to file

a36d24fc

Jan 17, 2019

Removed model files and changed hard-coded names. · e073c652

Maciej Sumalvico authored 6 years ago

- Model files are moved to a separate repository ('cor-asv-fst-models').
  As a temporary solution, the directory 'hfst/fst' has to be linked to the
  location of the model repository so that the hard-coded paths to model files
  work.
- Changed the hard-coded model file names in process_test_data.py to match the
  names of files created by the training scripts.

e073c652

Better path handling · d1de0d99

Maciej Sumalvico authored 6 years ago

- using os.path.join() instead of string concatenation
- removed (useless) trailing slashes from directory names

Minor changes:
- rename: x -> filename in helper.generate_content()

d1de0d99

Substituted '\u0364' for U+0364. · 8dd7cded

Maciej Sumalvico authored 6 years ago

The character U+0364 (combining latin small letter e):
- is invisible in some terminal fonts,
- breaks syntax highlighting in Vim.

8dd7cded

Refactored process_dta_data.py · dabf0d2f

Maciej Sumalvico authored 6 years ago

- setup_spacy() and parse_arguments() as separate functions
- more readable formatting

dabf0d2f

Jan 16, 2019

Refactored parsing command-line arguments. · 7375f047
Maciej Sumalvico authored 6 years ago
```
- Isolated parsing command-line arguments as a separate function.
- More readable formatting.
```
7375f047

Refactored sliding_window.merge_states() and fixed a bug causing crashes. · ff832931

Maciej Sumalvico authored 6 years ago

- Refactoring: structured the body of merge_states() into smaller subfunctions.
- Crashes were caused by deleting transitions from the transducer while
  iterating over it. In the updated version, the transitions to delete are
  first identified and then deleted.

ff832931

Do not start the multiprocessing machinery for one process. · 48c4b77e
Maciej Sumalvico authored 6 years ago
```
reason: easier debugging
```
48c4b77e

Dec 15, 2018
- renamed pre-built lexicon transducers for dta19-reduced, added same for dta19 · 137a2e50
  Robert Sachunsky authored 6 years ago
  
  137a2e50
- amend f25d826d7: in processor, when splitting input into symbols and flags,... · af52ba8e
  Robert Sachunsky authored 6 years ago
  
  amend f25d826d7: in processor, when splitting input into symbols and flags, fix a new crash with decomposed/combining characters
  af52ba8e
- force GPU for spaCy if requested, add comments on further problematic cases · a8c94e90
  Robert Sachunsky authored 6 years ago
  
  a8c94e90
- replace slow and memory/stack-devouring alignment.sequencealigner by fast... · 29aa54ae
  Robert Sachunsky authored 6 years ago
  
  replace slow and memory/stack-devouring alignment.sequencealigner by fast difflib.SequenceMatcher in evaluation too
  29aa54ae
Dec 14, 2018
- allow training error transducer on CSV file instead of directory, plus Python... · 5b01d251
  Robert Sachunsky authored 6 years ago
  
  allow training error transducer on CSV file instead of directory, plus Python 2 compatibility and some Pylint cosmetics
  5b01d251
Nov 21, 2018

add option to use GPU for spaCy · 2095714e
Robert Sachunsky authored 6 years ago

2095714e
add precision-recall evaluation metric · e094bb2f
Robert Sachunsky authored 6 years ago

e094bb2f

allow calling with rejection weight -1 for efficient ROC measurements · 68a367b5

Robert Sachunsky authored 6 years ago

with REJECTION_WEIGHT set to -1, behave differently:
- pyComposition (if enabled) does no backoff_result()
- union with input (and input weight mapping) is instead
  done in compose_and_search afterwards, but with a long
  vector of sensible thresholds to each apply alternatively
  on the same window result (i.e. without having to rerun
  everything)
- so all results up the call graph now have to be vectorized
  as well: compose_and_search, create_result_transducer,
  window_size_1_2, main/process;
  in the normal case, there is only 1 value in that vector

68a367b5

improved processor: · e4ef86b9

Robert Sachunsky authored 6 years ago

- catch and show exceptions among pool workers
- exit with failure if exceptions occurred in the end

e4ef86b9

fixed processor again: · 28fb7f96

Robert Sachunsky authored 6 years ago

- normalize Unicode strings to normal form,
  and respect remaining decomposed characters
  when splitting input into symbols and flags
  (fixes crashes)

- amend 4ebb55e5: without openFST, as_transducer can still be True already

28fb7f96

composition also needs to merge symbol tables for input transducers! · bd253ede
Robert Sachunsky authored 6 years ago

bd253ede

Nov 16, 2018

addition for 13c50055 · 02f07b01

Robert Sachunsky authored 6 years ago

during lexicon extraction, also add '/' as infix and suffix to Spacy's tokenizer

02f07b01

correction for d3eab2d0ccc: · ceb750a6

Robert Sachunsky authored 6 years ago

fix transducer definitions:

- when repeating lexicon transducer according to words_per_window,
  the last token takes a space character as well

- further repair inter-word/lm lexicon model:
  - last token also needs a flag acceptor (and a space)
  - edits deleting a space should delete the corresponding flag
    in this model too

ceb750a6

improve lexicon extraction: · 3d9fc6b5

Robert Sachunsky authored 6 years ago

- allow (large) input files with more than 1 line
- use generators (strip lines and split at newline)
- prune lexicon with combined absolute (<=3) and
  relative (<1e-5) frequency threshold
- extend number normalization for numerals with
  decimal point and thousands separators
- normalize umlauts to always use decomposed form
  with diacritical combining e
- speed up by disabling parser and NER in Spacy
- add '—' as infix to Spacy's tokenizer
- add CLI, make available as parameters:
  dictionary path, GT suffix

3d9fc6b5

improve lexicon transducer definitions: · 95e92e6c

Robert Sachunsky authored 6 years ago

- when extending lexicon transducer according to composition_depth,
  do not ignore upper/lower case completely, but ensure that
  non-first words are downcased (with infix/zero connection) or
  only upper case (with hyphen connection), and that first words
  are upcased or already upper case

- when extending lexicon transducer with morphology,
  compose *after* compounds were added

- when using lexicon transducer, make sure to allow both precomposed umlauts
  and decomposed (with diacritical combining e);
  also, ensure the final lexicon becomes but an acceptor

- when repeating lexicon transducer according to words_per_window,
  use 1 to N instead of 0 to N (optionalized lexicon), but make sure
  the last (1) token has no space

- repair inter-word/lm lexicon model previously defunct:
  - by stripping initial space from loaded punctuation_right_transducer
  - by correctly synchronizing on flags

95e92e6c

with temporary files as OpenFST interface, use sensible filename patterns, and... · fc41831a
Robert Sachunsky authored 6 years ago
```
with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
```
fc41831a
with temporary files as OpenFST interface, use sensible filename patterns, and... · 9174504a
Robert Sachunsky authored 6 years ago
```
with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
```
9174504a
ensure tokenization is consistent with lexicon (special case '—') · c38995ce
Robert Sachunsky authored 6 years ago

c38995ce

when combining windows, search for the next existing flag instead of blindly... · bca6846b

Robert Sachunsky authored 6 years ago

when combining windows, search for the next existing flag instead of blindly assuming next counting flag always remains in some path even after word merge (fixes crash)

bca6846b