Skip to content
Snippets Groups Projects
  1. Feb 15, 2019
  2. Feb 14, 2019
  3. Feb 13, 2019
    • Maciej Sumalvico's avatar
      A new implementation of the sliding window algorithm · ca94c7d6
      Maciej Sumalvico authored
      - much cleaner and smaller code
      - without flag diacritics and state merging
      - the windows are recombined by adding transitions between consecutive
        windows (see the PDF documentation for details)
      - test_sliding_window_no_flags.py is a temporary testing script (intended to be
        removed after the module is integrated into main processing)
      - known issues:
        - process_window_with_openfst() doesn't work - the composition returns a
          transducer accepting garbage paths
        - process_window_with_hfst() is very slow
        - not yet integrated into process_test_data.py
      ca94c7d6
  4. Feb 11, 2019
  5. Feb 07, 2019
  6. Feb 05, 2019
    • Maciej Sumalvico's avatar
      Refactoring and bugfixes in the precision/recall metric · cd07bb2b
      Maciej Sumalvico authored
      Changes in get_precision_recall():
      - Refactoring: separating the funtionality of merging alignments from scoring.
      - Changed the definition of true/false and positive/negative. The characters
        that are originally wrong and wrongly corrected are now false positives
        (previously: false negatives). This also changes the evaluation results quite
        significantly!
      - Bugfix: consider only one-to-one or one-to-zero alignments, but not
        one-to-many. This also changes the results.
      - Code cleaning.
      cd07bb2b
  7. Feb 01, 2019
  8. Jan 30, 2019
    • Maciej Sumalvico's avatar
      Refactored evaluate_correction.main() · 1a76c78c
      Maciej Sumalvico authored
      - command-line argument parsing moved into a separate function
      - computing each evaluation metric over all lines moved into a separate
        function, so that the logic of main() reduces to a simple three-way `if`
      - increased the spacing between top-level declarations to
        two blank lines (PEP 8)
      - added the -G parameter for providing the ground truth suffix
      1a76c78c
    • Maciej Sumalvico's avatar
      Refactoring in process_test_data.correct_string() · 1a2a7ea0
      Maciej Sumalvico authored
      Isolated some activities into subfuctions for better structuring.
      
      Changed the logging level of showing input/output strings from "info" to
      "debug".
      1a2a7ea0
    • Maciej Sumalvico's avatar
      Refactoring in process_test_data.py · d8fbb535
      Maciej Sumalvico authored
      - grouped the globals into two dictionaries: `gl_config` and `model`
      - renamed `process` to `correct_string`
      - renamed `load_model` to `build_model` (does other things apart from loading)
      - isolated some functionality from the `main` function
        - parallel processing of strings -> `parallel_process`
        - printing results -> `print_results`
        - building transducer composition, flag encoder and loading LM transducers ->
          `build_model`
      - minimized the availability of globals to increase readability and avoid bugs
        - globals are only visible in `main()` and `correct_string()`, but not in any
          subfunctions that `main()` calls
        - instead of passing `args` (the argument parser) as a global, the dictionary
          `gl_config` is used, which contains only the values used by
          `correct_string()`
      d8fbb535
  9. Jan 29, 2019
    • Maciej Sumalvico's avatar
      Refactored the model building functions · 7f42b9c4
      Maciej Sumalvico authored
      The model-building functions in sliding_window.py (load_transducers_*())
      previously contained three kinds of functionalities:
      - loading transducers
      - variant-specific combining of transducers to a single token acceptor
      - variant-independent functionality, which is copy-pasted in all three
        functions (adding flags, converting a single token acceptor to a window etc.)
      
      This commit isolates the variant-independent functionality into smaller
      functions build_single_token_acceptor_*(), combines the variant-independent
      parts for all three variants in the function build_model() and puts the loading
      of transducers outside of the `sliding_window` module.
      
      Furthermore:
      - renamed process_test_data.load_transducers() to load_model
      7f42b9c4
    • Maciej Sumalvico's avatar
      Refactoring of process_test_data.py · 94f3d9ef
      Maciej Sumalvico authored
      - isolated loading transducers into a separate function
      - isolated preparing the composition of lexicon and model into a separate
        function
      - moved process() before main()
      - cleaned up commented-out code, old file names etc.
      94f3d9ef
  10. Jan 22, 2019
  11. Jan 21, 2019
  12. Jan 18, 2019
    • Maciej Sumalvico's avatar
      Finished refactoring of error_transducer.main() · e446f4d1
      Maciej Sumalvico authored
      - refactored the reading of training data
      e446f4d1
    • Maciej Sumalvico's avatar
      Refactored error_transducer.main() · 03951838
      Maciej Sumalvico authored
      - functionalities isolated into separated functions:
        - creating a single error transducer
        - combining error transducers
      - fixed a bug causing only context = 3 to be considered (line 467 pre-commit,
        previously line 41 in error_transducer_complete.py)
      - simplified transducer creation
      03951838
    • Maciej Sumalvico's avatar
      Merged error transducer creating scripts · d25d8840
      Maciej Sumalvico authored
      Merged `error_transducer_complete.py` into `error_transducer.py`, so that one
      module is responsible for training an error model.
      d25d8840
    • Maciej Sumalvico's avatar
      Some refactoring in error_transducer.py · b9778286
      Maciej Sumalvico authored
      - isolated parse_arguments() as a separate function
      - added a -G parameter (gt_suffix) instead of a fixed suffix
      - removed some unnecessary comments
      b9778286
    • Maciej Sumalvico's avatar
      Merged lexicon building scripts · a36d24fc
      Maciej Sumalvico authored
      process_dta_data.py was merged into lexicon_transducer.py so that only a single
      module is responsible for building the lexicon.
      
      The lexica are no more saved as plaintext. This information can be easily
      obtained with `hfst-fst2strings -w`.
      
      helper.py:
      - the logarithm of frequencies is computed in the normalizing function, rather
        than during writing to file
      a36d24fc
  13. Jan 17, 2019
    • Maciej Sumalvico's avatar
      Removed model files and changed hard-coded names. · e073c652
      Maciej Sumalvico authored
      - Model files are moved to a separate repository ('cor-asv-fst-models').
        As a temporary solution, the directory 'hfst/fst' has to be linked to the
        location of the model repository so that the hard-coded paths to model files
        work.
      - Changed the hard-coded model file names in process_test_data.py to match the
        names of files created by the training scripts.
      e073c652
    • Maciej Sumalvico's avatar
      Better path handling · d1de0d99
      Maciej Sumalvico authored
      - using os.path.join() instead of string concatenation
      - removed (useless) trailing slashes from directory names
      
      Minor changes:
      - rename: x -> filename in helper.generate_content()
      d1de0d99
    • Maciej Sumalvico's avatar
      Substituted '\u0364' for U+0364. · 8dd7cded
      Maciej Sumalvico authored
      The character U+0364 (combining latin small letter e):
      - is invisible in some terminal fonts,
      - breaks syntax highlighting in Vim.
      8dd7cded
    • Maciej Sumalvico's avatar
      Refactored process_dta_data.py · dabf0d2f
      Maciej Sumalvico authored
      - setup_spacy() and parse_arguments() as separate functions
      - more readable formatting
      dabf0d2f
  14. Jan 16, 2019
  15. Jan 08, 2019
  16. Dec 15, 2018
  17. Dec 14, 2018
  18. Nov 21, 2018
Loading