Skip to content
Snippets Groups Projects
  1. Aug 22, 2020
  2. Aug 21, 2020
  3. Jan 08, 2020
  4. Nov 17, 2019
    • Robert Sachunsky's avatar
      fix parameters and requirements: · a0bc5f9f
      Robert Sachunsky authored
      - add pynini dependency
      - restrict parameter textequiv_level to word
        (which is the only one currently supported),
        remove default (glyph)
      - add parameter for reference to rejection_weight
        (which did not exist)
      - rename beam_width in FST part to pruning_weight
        (parallel to rejection_weight), add description
      - replace parameter reference to lm_beam_width
        (which did not exist) to beam_width
      - add FIXMEs for things that are apparently broken now
      - add description of the OCR-D processor's behaviour
        to its docstring, improve README
      a0bc5f9f
    • Robert Sachunsky's avatar
      fix OCR-D interfaces: · 85cc3c95
      Robert Sachunsky authored
      - remove exception when calling --help or -J
      - use page_from_file properly
      - use correct attributes for MetadataItem
      85cc3c95
  5. Nov 16, 2019
  6. Jul 23, 2019
  7. Jul 22, 2019
  8. Jul 19, 2019
    • Maciej Sumalvico's avatar
      CLI parameter --unweighted-lexicon · 468d2232
      Maciej Sumalvico authored
      Passing this parameter on training creates an unweighted lexicon FST. This
      should be done when combining the FST model with a language model (like
      `keraslm`), because a weighted lexicon is itself a (unigram) language model.
      468d2232
  9. Jul 18, 2019
  10. Jul 17, 2019
  11. Jul 16, 2019
  12. Apr 24, 2019
    • Maciej Sumalvico's avatar
      clean up some deprecated parameters · 8e1cb64e
      Maciej Sumalvico authored
      - remove the deprecated CLI parameters:
        - `apply_lm` - not used since a long time
        - `num_results` - replaced with `beam_width` after the switch to pynini
      - removed passing some unnecessary parameters to
        `scripts.process.prepare_model()`
      8e1cb64e
  13. Apr 17, 2019
  14. Apr 12, 2019
  15. Apr 11, 2019
    • Maciej Sumalvico's avatar
      Removed the NLTK dependency · 997492f2
      Maciej Sumalvico authored
      NLTK was only used for computing character n-grams from strings.
      997492f2
    • Maciej Sumalvico's avatar
      Removed the `alignment` dependency and related code · bb45e3a4
      Maciej Sumalvico authored
      The related code was commented out since a long time ago anyway. Also some
      neighboring commented-out code was removed.
      bb45e3a4
    • Maciej Sumalvico's avatar
      removed unnecessary imports · 317e1b36
      Maciej Sumalvico authored
      317e1b36
    • Maciej Sumalvico's avatar
      Removed the HFST dependency and HFST-related code · ff0186e7
      Maciej Sumalvico authored
      Also removed `helper.create_dict()` (doesn't use HFST, but was obsolete
      anyway).
      
      Furthermore, removed `scripts.process.prepare_composition()` (was no longer in
      use, forgot to remove it with b5b1fd67).
      ff0186e7
    • Maciej Sumalvico's avatar
      Remove the directory `lib.__DEPRECATED__` · 2a31d2ef
      Maciej Sumalvico authored
      The code there is HFST-dependent and so obsolete that it is no longer relevant
      for further development.
      2a31d2ef
    • Maciej Sumalvico's avatar
      Remove the CLIs from `lib.lexicon` and `lib.error_simp` · 2611c59a
      Maciej Sumalvico authored
      The CLIs were no longer used, since `scripts.train` is used for training. They
      were becoming increasingly deprecated (especially after switching away from
      HFST).
      2611c59a
    • Maciej Sumalvico's avatar
      Removed the C++ extension + version bump · bb27f36f
      Maciej Sumalvico authored
      The Cython extension for computing the FST composition is no longer needed.
      
      The version number was increased to 0.2.0 as this is a quite important change.
      bb27f36f
    • Maciej Sumalvico's avatar
      Implemented the FST processing using Pynini · bae760f9
      Maciej Sumalvico authored
      The back-end for processing FSTs was changed from HFST to Pynini. The
      functionality implemented so far is:
      - lexicon training
      - simple error model training
      - processing plain text
        - window recombination using `pynini.replace()`
      
      Further related changes:
      - as Pynini does not support the `n_best()` method, beam search will be used
        instead - the hypotheses are pruned to those within `beam_width` weight
        from the best one *after each composition*, i.e. first after the composition
        with the error model and then once again after the composition with the
        lexicon (in order to keep a manageable size of the hypotheses FST);
        currently, `beam_size` is hardcoded to `5`, but it should be made a
        parameter; lower values allow for faster execution times, but may miss some
        corrections
      - removed the parameter `frequency_class` from
        `lib.error_simp.transducer_from_list()` (never used)
      - the behavior of `rejection_weight` was implemented to mimic the one in the
        Cython extension - i.e. the rejection weight of a word is
        `rejection_weight*(len(word)+2)`. The `+2` originally comes from the "flag"
        transitions, but turned out to be useful by preventing the rejection of short
        words.
      - added a test suite (to be extended later)
      
      No longer required:
      - the Cython extension
      - passing temporary files between the Python and the C++ part
      - the HFST dependency (except for `error_st`, which is currently incompatible
        with the rest)
      
      Remaining issues:
      - switch the ST error model implementation to use pynini
      - code cleaning: remove unused dependencies and deprecated code (esp. the
        HFST-related parts)
      - restore some functionality that was temporary removed to simplify the
        transition
        - special rules for digits and umlauts in the lexicon
        - compounds in the lexicon
      - make `beam_width` a free parameter
      - unit tests
      bae760f9
  16. Apr 08, 2019
  17. Mar 29, 2019
Loading