Commits · 2428cca3e14b6ee28a25ba96049f29981f408d5f · ocr-d / cor-asv-fst

Aug 22, 2020
- 0.2.1 · 2428cca3
  Robert Sachunsky authored 4 years ago
  
  2428cca3
- Merge pull request #5 from kba/fileids-and-such · de4ca24b
  Robert Sachunsky authored 4 years ago
  
  Fileids and such
  Unverified
  
  de4ca24b
- Update requirements.txt · 693cb240
  Robert Sachunsky authored 4 years ago
  
  Unverified
  
  693cb240
- wrapper.decode: set pageId, use .xml suffix · 6ade1155
  Robert Sachunsky authored 4 years ago
  
  Unverified
  
  6ade1155
- make test: warn that this does not do much · 4d400838
  Robert Sachunsky authored 4 years ago
  
  Unverified
  
  4d400838
Aug 21, 2020
- FSTCorrection: use make_file_id and assert_file_grp_cardinality · 0c610561
  Konstantin Baierer authored 4 years ago
  
  0c610561
- make test · bc56e7ee
  Konstantin Baierer authored 4 years ago
  
  bc56e7ee
Jan 08, 2020
- add deps-ubuntu, build OpenFST locally to satisfy pynini requirements · bc6d84c1
  Robert Schubert authored 5 years ago and Robert Sachunsky committed 5 years ago
  
  bc6d84c1
- cannot trust pynini from PyPI · e71f00e0
  Robert Schubert authored 5 years ago
  
  e71f00e0
Nov 17, 2019

fix parameters and requirements: · a0bc5f9f

- add pynini dependency
- restrict parameter textequiv_level to word
  (which is the only one currently supported),
  remove default (glyph)
- add parameter for reference to rejection_weight
  (which did not exist)
- rename beam_width in FST part to pruning_weight
  (parallel to rejection_weight), add description
- replace parameter reference to lm_beam_width
  (which did not exist) to beam_width
- add FIXMEs for things that are apparently broken now
- add description of the OCR-D processor's behaviour
  to its docstring, improve README

a0bc5f9f

fix OCR-D interfaces: · 85cc3c95

Robert Sachunsky authored 5 years ago

- remove exception when calling --help or -J
- use page_from_file properly
- use correct attributes for MetadataItem

85cc3c95

Nov 16, 2019
- add cor-asv-fst-models subrepo · 08b3f176
  Robert Sachunsky authored 5 years ago
  
  08b3f176
- add symlink to tool json · 147bc5fc
  Robert Sachunsky authored 5 years ago
  
  147bc5fc
Jul 23, 2019

Integrated the ST error model with the current code · 63c2ec3e

Maciej Sumalvico authored 5 years ago

- reimplemented the FST compilation for Pynini
- plugged the training facility of the ST error model to the CLI
  (`cor-asv-train -T st`)
- removed deprecated code

63c2ec3e

Jul 22, 2019

Training lexicon from a corpus or wordlist · 36429e9b

Maciej Sumalvico authored 5 years ago

Implemented CLI parameters `-c` and `-w` for `cor-ast-fst-train`, allowing for
passing a corpus and a list of words with frequencies to be included in the
lexicon.

36429e9b

mention ocrd_keraslm in the readme · 013f2866
Maciej Sumalvico authored 5 years ago

013f2866

Added docstrings and some cleaning · bc6dbbb7

Maciej Sumalvico authored 5 years ago

- removed deprecated functions
- fixed docstring formatting at some places
- described the pynini dependency in README

bc6dbbb7

Jul 19, 2019

CLI parameter --unweighted-lexicon · 468d2232

Maciej Sumalvico authored 5 years ago

Passing this parameter on training creates an unweighted lexicon FST. This
should be done when combining the FST model with a language model (like
`keraslm`), because a weighted lexicon is itself a (unigram) language model.

468d2232

Jul 18, 2019
- rename the PageXML processor class and restore its functionality · 1cf74440
  Maciej Sumalvico authored 5 years ago
  
  - rename: `wrapper.FSTCorrection` -> `wrapper.PageXMLProcessor` - use `lib.FSTLatticeGenerator` instead of a tuple of FSTs
  1cf74440
- moved all the content of `lib.sliding_window` to `lib.latticegen` · 7ae5c065
  Maciej Sumalvico authored 5 years ago
  
  7ae5c065
Jul 17, 2019
- integrate the ocrd_keraslm language model into plaintext processing · 640f57c3
  Maciej Sumalvico authored 5 years ago
  
  640f57c3
- class `FSTLatticeGenerator` for generating the hypotheses · 45f9106c
  Maciej Sumalvico authored 5 years ago
  
  45f9106c
- Introduced the class `scripts.process.PlaintextProcessor` · fd801a0c
  Maciej Sumalvico authored 5 years ago
  
  - all globals are contained in the PlaintextProcessor object - no need for separating `gl_config` and `model` and passing the FST model around as a tuple of transducers
  fd801a0c
Jul 16, 2019

`wrapper.decode.FSTCorrection` rewritten · 81dd2c0c

Maciej Sumalvico authored 5 years ago

- use the current library (FST-based decoding) for generating the hypotheses
  graph
- update to `ocrd` v1.0.0b5
- refactoring

81dd2c0c

Apr 24, 2019

clean up some deprecated parameters · 8e1cb64e

Maciej Sumalvico authored 5 years ago

- remove the deprecated CLI parameters:
  - `apply_lm` - not used since a long time
  - `num_results` - replaced with `beam_width` after the switch to pynini
- removed passing some unnecessary parameters to
  `scripts.process.prepare_model()`

8e1cb64e

Apr 17, 2019
- made the beam width configurable · 85b226f6
  Maciej Sumalvico authored 5 years ago
  
  85b226f6
Apr 12, 2019
- Removed the script for running comparisons · 046e1962
  Maciej Sumalvico authored 5 years ago
  
  (no longer compatible with the current CLI)
  046e1962
Apr 11, 2019

Removed the NLTK dependency · 997492f2
Maciej Sumalvico authored 5 years ago
```
NLTK was only used for computing character n-grams from strings.
```
997492f2

Removed the `alignment` dependency and related code · bb45e3a4

Maciej Sumalvico authored 5 years ago

The related code was commented out since a long time ago anyway. Also some
neighboring commented-out code was removed.

bb45e3a4

removed unnecessary imports · 317e1b36
Maciej Sumalvico authored 5 years ago

317e1b36

Removed the HFST dependency and HFST-related code · ff0186e7

Maciej Sumalvico authored 5 years ago

Also removed `helper.create_dict()` (doesn't use HFST, but was obsolete
anyway).

Furthermore, removed `scripts.process.prepare_composition()` (was no longer in
use, forgot to remove it with b5b1fd67).

ff0186e7

Remove the directory `lib.__DEPRECATED__` · 2a31d2ef

Maciej Sumalvico authored 5 years ago

The code there is HFST-dependent and so obsolete that it is no longer relevant
for further development.

2a31d2ef

Remove the CLIs from `lib.lexicon` and `lib.error_simp` · 2611c59a

Maciej Sumalvico authored 5 years ago

The CLIs were no longer used, since `scripts.train` is used for training. They
were becoming increasingly deprecated (especially after switching away from
HFST).

2611c59a

Removed the C++ extension + version bump · bb27f36f

Maciej Sumalvico authored 5 years ago

The Cython extension for computing the FST composition is no longer needed.

The version number was increased to 0.2.0 as this is a quite important change.

bb27f36f

Implemented the FST processing using Pynini · bae760f9

Maciej Sumalvico authored 5 years ago

The back-end for processing FSTs was changed from HFST to Pynini. The
functionality implemented so far is:
- lexicon training
- simple error model training
- processing plain text
  - window recombination using `pynini.replace()`

Further related changes:
- as Pynini does not support the `n_best()` method, beam search will be used
  instead - the hypotheses are pruned to those within `beam_width` weight
  from the best one *after each composition*, i.e. first after the composition
  with the error model and then once again after the composition with the
  lexicon (in order to keep a manageable size of the hypotheses FST);
  currently, `beam_size` is hardcoded to `5`, but it should be made a
  parameter; lower values allow for faster execution times, but may miss some
  corrections
- removed the parameter `frequency_class` from
  `lib.error_simp.transducer_from_list()` (never used)
- the behavior of `rejection_weight` was implemented to mimic the one in the
  Cython extension - i.e. the rejection weight of a word is
  `rejection_weight*(len(word)+2)`. The `+2` originally comes from the "flag"
  transitions, but turned out to be useful by preventing the rejection of short
  words.
- added a test suite (to be extended later)

No longer required:
- the Cython extension
- passing temporary files between the Python and the C++ part
- the HFST dependency (except for `error_st`, which is currently incompatible
  with the rest)

Remaining issues:
- switch the ST error model implementation to use pynini
- code cleaning: remove unused dependencies and deprecated code (esp. the
  HFST-related parts)
- restore some functionality that was temporary removed to simplify the
  transition
  - special rules for digits and umlauts in the lexicon
  - compounds in the lexicon
- make `beam_width` a free parameter
- unit tests

bae760f9

Apr 08, 2019

removed the HFST import in `scripts.process` · 5a40df14

Maciej Sumalvico authored 5 years ago

- the functionality of finding the shortest path in the lattice was moved to
  `lib.sliding_window.lattice_shortest_path()`
- also removed some deprecated code from `scripts.process.correct_string()`

5a40df14

Mar 29, 2019
- works around #1 · 4fd2a2bf
  Maciej Sumalvico authored 5 years ago
  
  4fd2a2bf
- moved the modules containing entry points from `lib` to `scripts` · b10acf81
  Maciej Sumalvico authored 5 years ago
  
  b10acf81
- Adopt the new implementation of sliding window (without flags) · f5d339bd
  Maciej Sumalvico authored 5 years ago
  
  - rename `sliding_window_no_flags` -> `sliding_window` - move the old `sliding_window` module (with flags) to `__DEPRECATED__` - remove the flag-related code from `error_simp` (do not add flags to the error transducer on training)
  f5d339bd
- removed `test_sliding_window_no_flags` · bb954a9a
  Maciej Sumalvico authored 5 years ago
  
  - obsolete and no longer needed - should be replaced with real unit tests
  bb954a9a