- Aug 22, 2020
-
-
Robert Sachunsky authored
-
Robert Sachunsky authored
Fileids and such
-
Robert Sachunsky authored
-
Robert Sachunsky authored
-
Robert Sachunsky authored
-
- Aug 21, 2020
-
-
Konstantin Baierer authored
-
Konstantin Baierer authored
-
- Jan 08, 2020
-
-
-
Robert Schubert authored
-
- Nov 17, 2019
-
-
Robert Sachunsky authored
- add pynini dependency - restrict parameter textequiv_level to word (which is the only one currently supported), remove default (glyph) - add parameter for reference to rejection_weight (which did not exist) - rename beam_width in FST part to pruning_weight (parallel to rejection_weight), add description - replace parameter reference to lm_beam_width (which did not exist) to beam_width - add FIXMEs for things that are apparently broken now - add description of the OCR-D processor's behaviour to its docstring, improve README
-
Robert Sachunsky authored
- remove exception when calling --help or -J - use page_from_file properly - use correct attributes for MetadataItem
-
- Nov 16, 2019
-
-
Robert Sachunsky authored
-
Robert Sachunsky authored
-
- Jul 23, 2019
-
-
Maciej Sumalvico authored
- reimplemented the FST compilation for Pynini - plugged the training facility of the ST error model to the CLI (`cor-asv-train -T st`) - removed deprecated code
-
- Jul 22, 2019
-
-
Maciej Sumalvico authored
Implemented CLI parameters `-c` and `-w` for `cor-ast-fst-train`, allowing for passing a corpus and a list of words with frequencies to be included in the lexicon.
-
Maciej Sumalvico authored
-
Maciej Sumalvico authored
- removed deprecated functions - fixed docstring formatting at some places - described the pynini dependency in README
-
- Jul 19, 2019
-
-
Maciej Sumalvico authored
Passing this parameter on training creates an unweighted lexicon FST. This should be done when combining the FST model with a language model (like `keraslm`), because a weighted lexicon is itself a (unigram) language model.
-
- Jul 18, 2019
-
-
Maciej Sumalvico authored
- rename: `wrapper.FSTCorrection` -> `wrapper.PageXMLProcessor` - use `lib.FSTLatticeGenerator` instead of a tuple of FSTs
-
Maciej Sumalvico authored
-
- Jul 17, 2019
-
-
Maciej Sumalvico authored
-
Maciej Sumalvico authored
-
Maciej Sumalvico authored
- all globals are contained in the PlaintextProcessor object - no need for separating `gl_config` and `model` and passing the FST model around as a tuple of transducers
-
- Jul 16, 2019
-
-
Maciej Sumalvico authored
- use the current library (FST-based decoding) for generating the hypotheses graph - update to `ocrd` v1.0.0b5 - refactoring
-
- Apr 24, 2019
-
-
Maciej Sumalvico authored
- remove the deprecated CLI parameters: - `apply_lm` - not used since a long time - `num_results` - replaced with `beam_width` after the switch to pynini - removed passing some unnecessary parameters to `scripts.process.prepare_model()`
-
- Apr 17, 2019
-
-
Maciej Sumalvico authored
-
- Apr 12, 2019
-
-
Maciej Sumalvico authored
(no longer compatible with the current CLI)
-
- Apr 11, 2019
-
-
Maciej Sumalvico authored
NLTK was only used for computing character n-grams from strings.
-
Maciej Sumalvico authored
The related code was commented out since a long time ago anyway. Also some neighboring commented-out code was removed.
-
Maciej Sumalvico authored
-
Maciej Sumalvico authored
Also removed `helper.create_dict()` (doesn't use HFST, but was obsolete anyway). Furthermore, removed `scripts.process.prepare_composition()` (was no longer in use, forgot to remove it with b5b1fd67).
-
Maciej Sumalvico authored
The code there is HFST-dependent and so obsolete that it is no longer relevant for further development.
-
Maciej Sumalvico authored
The CLIs were no longer used, since `scripts.train` is used for training. They were becoming increasingly deprecated (especially after switching away from HFST).
-
Maciej Sumalvico authored
The Cython extension for computing the FST composition is no longer needed. The version number was increased to 0.2.0 as this is a quite important change.
-
Maciej Sumalvico authored
The back-end for processing FSTs was changed from HFST to Pynini. The functionality implemented so far is: - lexicon training - simple error model training - processing plain text - window recombination using `pynini.replace()` Further related changes: - as Pynini does not support the `n_best()` method, beam search will be used instead - the hypotheses are pruned to those within `beam_width` weight from the best one *after each composition*, i.e. first after the composition with the error model and then once again after the composition with the lexicon (in order to keep a manageable size of the hypotheses FST); currently, `beam_size` is hardcoded to `5`, but it should be made a parameter; lower values allow for faster execution times, but may miss some corrections - removed the parameter `frequency_class` from `lib.error_simp.transducer_from_list()` (never used) - the behavior of `rejection_weight` was implemented to mimic the one in the Cython extension - i.e. the rejection weight of a word is `rejection_weight*(len(word)+2)`. The `+2` originally comes from the "flag" transitions, but turned out to be useful by preventing the rejection of short words. - added a test suite (to be extended later) No longer required: - the Cython extension - passing temporary files between the Python and the C++ part - the HFST dependency (except for `error_st`, which is currently incompatible with the rest) Remaining issues: - switch the ST error model implementation to use pynini - code cleaning: remove unused dependencies and deprecated code (esp. the HFST-related parts) - restore some functionality that was temporary removed to simplify the transition - special rules for digits and umlauts in the lexicon - compounds in the lexicon - make `beam_width` a free parameter - unit tests
-
- Apr 08, 2019
-
-
Maciej Sumalvico authored
- the functionality of finding the shortest path in the lattice was moved to `lib.sliding_window.lattice_shortest_path()` - also removed some deprecated code from `scripts.process.correct_string()`
-
- Mar 29, 2019
-
-
Maciej Sumalvico authored
-
Maciej Sumalvico authored
-
Maciej Sumalvico authored
- rename `sliding_window_no_flags` -> `sliding_window` - move the old `sliding_window` module (with flags) to `__DEPRECATED__` - remove the flag-related code from `error_simp` (do not add flags to the error transducer on training)
-
Maciej Sumalvico authored
- obsolete and no longer needed - should be replaced with real unit tests
-