Commits · 1d20a16b1e6751dda7a275c39282c39b5aa6e2e4 · ocr-d / cor-asv-fst

Feb 15, 2019
- minor: changed the hard-coded lexicon filename · 1d20a16b
  Maciej Sumalvico authored 6 years ago
  
  1d20a16b
Feb 14, 2019
- use the new sliding window algorithm in process_test_data.py · 997196f6
  Maciej Sumalvico authored 6 years ago
  
  997196f6
- minor: removed an obsolete comment · 35ced29d
  Maciej Sumalvico authored 6 years ago
  
  35ced29d
- Fixed a bug causing weird output symbols · d09066d7
  Maciej Sumalvico authored 6 years ago
  
  sliding_window_no_flags.process_window_with_openfst() now works properly
  d09066d7
Feb 13, 2019

A new implementation of the sliding window algorithm · ca94c7d6

- much cleaner and smaller code
- without flag diacritics and state merging
- the windows are recombined by adding transitions between consecutive
  windows (see the PDF documentation for details)
- test_sliding_window_no_flags.py is a temporary testing script (intended to be
  removed after the module is integrated into main processing)
- known issues:
  - process_window_with_openfst() doesn't work - the composition returns a
    transducer accepting garbage paths
  - process_window_with_hfst() is very slow
  - not yet integrated into process_test_data.py

ca94c7d6

Feb 11, 2019
- Removed sliding_window.write_fst() · f70e845a
  Maciej Sumalvico authored 6 years ago
  
  (duplicate of helper.save_transducer())
  f70e845a
- updated the comments in lexicon_transducer.main() · 435fbfbd
  Maciej Sumalvico authored 6 years ago
  
  435fbfbd
Feb 07, 2019

Changed the directory structure · beacfab5

Maciej Sumalvico authored 6 years ago

- hfst/ -> ./
- cython/ -> extensions/
- currently unused files moved to __DEPRECATED/
- added .gitignore

beacfab5

Removed non-code from the repo · 8cbc6135

Maciej Sumalvico authored 6 years ago

- hfst/
  - ASSE data removed
- open-fst/
  - transducers removed (most are duplicates of the ones stored in
    cor-asv-fst-models; others, like the ASSE lexicon, can be trained if
    needed; shouldn't clutter the repo)
  - the report about evaluation experiments moved to doc-fst/evaluation-openfst
  - gesamt_dta_spaces.syms removed

8cbc6135

Merge branch 'fst-refactoring-2' · 3ac78858
Maciej Sumalvico authored 6 years ago

3ac78858

Feb 05, 2019

Refactoring and bugfixes in the precision/recall metric · cd07bb2b

Maciej Sumalvico authored 6 years ago

Changes in get_precision_recall():
- Refactoring: separating the funtionality of merging alignments from scoring.
- Changed the definition of true/false and positive/negative. The characters
  that are originally wrong and wrongly corrected are now false positives
  (previously: false negatives). This also changes the evaluation results quite
  significantly!
- Bugfix: consider only one-to-one or one-to-zero alignments, but not
  one-to-many. This also changes the results.
- Code cleaning.

cd07bb2b

Feb 01, 2019

minor bugfix · 9fce122f

Maciej Sumalvico authored 6 years ago

`process` was renamed to `correct_string` in 2fc4eb20, but not in line 235.
This commit fixes it.

9fce122f

Jan 30, 2019

Refactored evaluate_correction.main() · 1a76c78c

Maciej Sumalvico authored 6 years ago

- command-line argument parsing moved into a separate function
- computing each evaluation metric over all lines moved into a separate
  function, so that the logic of main() reduces to a simple three-way `if`
- increased the spacing between top-level declarations to
  two blank lines (PEP 8)
- added the -G parameter for providing the ground truth suffix

1a76c78c

Refactoring in process_test_data.correct_string() · 1a2a7ea0

Maciej Sumalvico authored 6 years ago

Isolated some activities into subfuctions for better structuring.

Changed the logging level of showing input/output strings from "info" to
"debug".

1a2a7ea0

Refactoring in process_test_data.py · d8fbb535

Maciej Sumalvico authored 6 years ago

- grouped the globals into two dictionaries: `gl_config` and `model`
- renamed `process` to `correct_string`
- renamed `load_model` to `build_model` (does other things apart from loading)
- isolated some functionality from the `main` function
  - parallel processing of strings -> `parallel_process`
  - printing results -> `print_results`
  - building transducer composition, flag encoder and loading LM transducers ->
    `build_model`
- minimized the availability of globals to increase readability and avoid bugs
  - globals are only visible in `main()` and `correct_string()`, but not in any
    subfunctions that `main()` calls
  - instead of passing `args` (the argument parser) as a global, the dictionary
    `gl_config` is used, which contains only the values used by
    `correct_string()`

d8fbb535

Jan 29, 2019

Refactored the model building functions · 7f42b9c4

Maciej Sumalvico authored 6 years ago

The model-building functions in sliding_window.py (load_transducers_*())
previously contained three kinds of functionalities:
- loading transducers
- variant-specific combining of transducers to a single token acceptor
- variant-independent functionality, which is copy-pasted in all three
  functions (adding flags, converting a single token acceptor to a window etc.)

This commit isolates the variant-independent functionality into smaller
functions build_single_token_acceptor_*(), combines the variant-independent
parts for all three variants in the function build_model() and puts the loading
of transducers outside of the `sliding_window` module.

Furthermore:
- renamed process_test_data.load_transducers() to load_model

7f42b9c4

Refactoring of process_test_data.py · 94f3d9ef

Maciej Sumalvico authored 6 years ago

- isolated loading transducers into a separate function
- isolated preparing the composition of lexicon and model into a separate
  function
- moved process() before main()
- cleaned up commented-out code, old file names etc.

94f3d9ef

Jan 22, 2019
- Replaced '@_EPSILON_SYMBOL_@' with hfst.EPSILON. · e28502b3
  Maciej Sumalvico authored 6 years ago
  
  e28502b3
Jan 21, 2019

Refactored lexicon_transducer.create_lexicon() · 631a3627

Maciej Sumalvico authored 6 years ago

- divided the body into smaller functions
- fixed (in line 140) a bug causing types that are identical in capitalized and
  uncapitalized form (like '—') to be counted double

631a3627

Jan 18, 2019

Finished refactoring of error_transducer.main() · e446f4d1
Maciej Sumalvico authored 6 years ago
```
- refactored the reading of training data
```
e446f4d1

Refactored error_transducer.main() · 03951838

Maciej Sumalvico authored 6 years ago

- functionalities isolated into separated functions:
  - creating a single error transducer
  - combining error transducers
- fixed a bug causing only context = 3 to be considered (line 467 pre-commit,
  previously line 41 in error_transducer_complete.py)
- simplified transducer creation

03951838

Merged error transducer creating scripts · d25d8840

Maciej Sumalvico authored 6 years ago

Merged `error_transducer_complete.py` into `error_transducer.py`, so that one
module is responsible for training an error model.

d25d8840

Some refactoring in error_transducer.py · b9778286

Maciej Sumalvico authored 6 years ago

- isolated parse_arguments() as a separate function
- added a -G parameter (gt_suffix) instead of a fixed suffix
- removed some unnecessary comments

b9778286

Merged lexicon building scripts · a36d24fc

Maciej Sumalvico authored 6 years ago

process_dta_data.py was merged into lexicon_transducer.py so that only a single
module is responsible for building the lexicon.

The lexica are no more saved as plaintext. This information can be easily
obtained with `hfst-fst2strings -w`.

helper.py:
- the logarithm of frequencies is computed in the normalizing function, rather
  than during writing to file

a36d24fc

Jan 17, 2019

Removed model files and changed hard-coded names. · e073c652

Maciej Sumalvico authored 6 years ago

- Model files are moved to a separate repository ('cor-asv-fst-models').
  As a temporary solution, the directory 'hfst/fst' has to be linked to the
  location of the model repository so that the hard-coded paths to model files
  work.
- Changed the hard-coded model file names in process_test_data.py to match the
  names of files created by the training scripts.

e073c652

Better path handling · d1de0d99

Maciej Sumalvico authored 6 years ago

- using os.path.join() instead of string concatenation
- removed (useless) trailing slashes from directory names

Minor changes:
- rename: x -> filename in helper.generate_content()

d1de0d99

Substituted '\u0364' for U+0364. · 8dd7cded

Maciej Sumalvico authored 6 years ago

The character U+0364 (combining latin small letter e):
- is invisible in some terminal fonts,
- breaks syntax highlighting in Vim.

8dd7cded

Refactored process_dta_data.py · dabf0d2f

Maciej Sumalvico authored 6 years ago

- setup_spacy() and parse_arguments() as separate functions
- more readable formatting

dabf0d2f

Jan 16, 2019

Refactored parsing command-line arguments. · 7375f047
Maciej Sumalvico authored 6 years ago
```
- Isolated parsing command-line arguments as a separate function.
- More readable formatting.
```
7375f047

Refactored sliding_window.merge_states() and fixed a bug causing crashes. · ff832931

Maciej Sumalvico authored 6 years ago

- Refactoring: structured the body of merge_states() into smaller subfunctions.
- Crashes were caused by deleting transitions from the transducer while
  iterating over it. In the updated version, the transitions to delete are
  first identified and then deleted.

ff832931

Do not start the multiprocessing machinery for one process. · 48c4b77e
Maciej Sumalvico authored 6 years ago
```
reason: easier debugging
```
48c4b77e

Jan 08, 2019
- Apache 2.0 hinzugefügt (Anforderung von OCR-D) · de5f0f6a
  Robert Sachunsky authored 6 years ago
  
  de5f0f6a
Dec 15, 2018
- renamed pre-built lexicon transducers for dta19-reduced, added same for dta19 · 137a2e50
  Robert Sachunsky authored 6 years ago
  
  137a2e50
- amend f25d826d7: in processor, when splitting input into symbols and flags,... · af52ba8e
  Robert Sachunsky authored 6 years ago
  
  amend f25d826d7: in processor, when splitting input into symbols and flags, fix a new crash with decomposed/combining characters
  af52ba8e
- force GPU for spaCy if requested, add comments on further problematic cases · a8c94e90
  Robert Sachunsky authored 6 years ago
  
  a8c94e90
- replace slow and memory/stack-devouring alignment.sequencealigner by fast... · 29aa54ae
  Robert Sachunsky authored 6 years ago
  
  replace slow and memory/stack-devouring alignment.sequencealigner by fast difflib.SequenceMatcher in evaluation too
  29aa54ae
Dec 14, 2018
- allow training error transducer on CSV file instead of directory, plus Python... · 5b01d251
  Robert Sachunsky authored 6 years ago
  
  allow training error transducer on CSV file instead of directory, plus Python 2 compatibility and some Pylint cosmetics
  5b01d251
Nov 21, 2018

add option to use GPU for spaCy · 2095714e
Robert Sachunsky authored 6 years ago

2095714e
add precision-recall evaluation metric · e094bb2f
Robert Sachunsky authored 6 years ago

e094bb2f

allow calling with rejection weight -1 for efficient ROC measurements · 68a367b5

Robert Sachunsky authored 6 years ago

with REJECTION_WEIGHT set to -1, behave differently:
- pyComposition (if enabled) does no backoff_result()
- union with input (and input weight mapping) is instead
  done in compose_and_search afterwards, but with a long
  vector of sensible thresholds to each apply alternatively
  on the same window result (i.e. without having to rerun
  everything)
- so all results up the call graph now have to be vectorized
  as well: compose_and_search, create_result_transducer,
  window_size_1_2, main/process;
  in the normal case, there is only 1 value in that vector

68a367b5