Skip to content
Snippets Groups Projects
  1. Feb 05, 2019
    • Maciej Sumalvico's avatar
      Refactoring and bugfixes in the precision/recall metric · cd07bb2b
      Maciej Sumalvico authored
      Changes in get_precision_recall():
      - Refactoring: separating the funtionality of merging alignments from scoring.
      - Changed the definition of true/false and positive/negative. The characters
        that are originally wrong and wrongly corrected are now false positives
        (previously: false negatives). This also changes the evaluation results quite
        significantly!
      - Bugfix: consider only one-to-one or one-to-zero alignments, but not
        one-to-many. This also changes the results.
      - Code cleaning.
      cd07bb2b
  2. Feb 01, 2019
  3. Jan 30, 2019
    • Maciej Sumalvico's avatar
      Refactored evaluate_correction.main() · 1a76c78c
      Maciej Sumalvico authored
      - command-line argument parsing moved into a separate function
      - computing each evaluation metric over all lines moved into a separate
        function, so that the logic of main() reduces to a simple three-way `if`
      - increased the spacing between top-level declarations to
        two blank lines (PEP 8)
      - added the -G parameter for providing the ground truth suffix
      1a76c78c
    • Maciej Sumalvico's avatar
      Refactoring in process_test_data.correct_string() · 1a2a7ea0
      Maciej Sumalvico authored
      Isolated some activities into subfuctions for better structuring.
      
      Changed the logging level of showing input/output strings from "info" to
      "debug".
      1a2a7ea0
    • Maciej Sumalvico's avatar
      Refactoring in process_test_data.py · d8fbb535
      Maciej Sumalvico authored
      - grouped the globals into two dictionaries: `gl_config` and `model`
      - renamed `process` to `correct_string`
      - renamed `load_model` to `build_model` (does other things apart from loading)
      - isolated some functionality from the `main` function
        - parallel processing of strings -> `parallel_process`
        - printing results -> `print_results`
        - building transducer composition, flag encoder and loading LM transducers ->
          `build_model`
      - minimized the availability of globals to increase readability and avoid bugs
        - globals are only visible in `main()` and `correct_string()`, but not in any
          subfunctions that `main()` calls
        - instead of passing `args` (the argument parser) as a global, the dictionary
          `gl_config` is used, which contains only the values used by
          `correct_string()`
      d8fbb535
  4. Jan 29, 2019
    • Maciej Sumalvico's avatar
      Refactored the model building functions · 7f42b9c4
      Maciej Sumalvico authored
      The model-building functions in sliding_window.py (load_transducers_*())
      previously contained three kinds of functionalities:
      - loading transducers
      - variant-specific combining of transducers to a single token acceptor
      - variant-independent functionality, which is copy-pasted in all three
        functions (adding flags, converting a single token acceptor to a window etc.)
      
      This commit isolates the variant-independent functionality into smaller
      functions build_single_token_acceptor_*(), combines the variant-independent
      parts for all three variants in the function build_model() and puts the loading
      of transducers outside of the `sliding_window` module.
      
      Furthermore:
      - renamed process_test_data.load_transducers() to load_model
      7f42b9c4
    • Maciej Sumalvico's avatar
      Refactoring of process_test_data.py · 94f3d9ef
      Maciej Sumalvico authored
      - isolated loading transducers into a separate function
      - isolated preparing the composition of lexicon and model into a separate
        function
      - moved process() before main()
      - cleaned up commented-out code, old file names etc.
      94f3d9ef
  5. Jan 22, 2019
  6. Jan 21, 2019
  7. Jan 18, 2019
    • Maciej Sumalvico's avatar
      Finished refactoring of error_transducer.main() · e446f4d1
      Maciej Sumalvico authored
      - refactored the reading of training data
      e446f4d1
    • Maciej Sumalvico's avatar
      Refactored error_transducer.main() · 03951838
      Maciej Sumalvico authored
      - functionalities isolated into separated functions:
        - creating a single error transducer
        - combining error transducers
      - fixed a bug causing only context = 3 to be considered (line 467 pre-commit,
        previously line 41 in error_transducer_complete.py)
      - simplified transducer creation
      03951838
    • Maciej Sumalvico's avatar
      Merged error transducer creating scripts · d25d8840
      Maciej Sumalvico authored
      Merged `error_transducer_complete.py` into `error_transducer.py`, so that one
      module is responsible for training an error model.
      d25d8840
    • Maciej Sumalvico's avatar
      Some refactoring in error_transducer.py · b9778286
      Maciej Sumalvico authored
      - isolated parse_arguments() as a separate function
      - added a -G parameter (gt_suffix) instead of a fixed suffix
      - removed some unnecessary comments
      b9778286
    • Maciej Sumalvico's avatar
      Merged lexicon building scripts · a36d24fc
      Maciej Sumalvico authored
      process_dta_data.py was merged into lexicon_transducer.py so that only a single
      module is responsible for building the lexicon.
      
      The lexica are no more saved as plaintext. This information can be easily
      obtained with `hfst-fst2strings -w`.
      
      helper.py:
      - the logarithm of frequencies is computed in the normalizing function, rather
        than during writing to file
      a36d24fc
  8. Jan 17, 2019
    • Maciej Sumalvico's avatar
      Removed model files and changed hard-coded names. · e073c652
      Maciej Sumalvico authored
      - Model files are moved to a separate repository ('cor-asv-fst-models').
        As a temporary solution, the directory 'hfst/fst' has to be linked to the
        location of the model repository so that the hard-coded paths to model files
        work.
      - Changed the hard-coded model file names in process_test_data.py to match the
        names of files created by the training scripts.
      e073c652
    • Maciej Sumalvico's avatar
      Better path handling · d1de0d99
      Maciej Sumalvico authored
      - using os.path.join() instead of string concatenation
      - removed (useless) trailing slashes from directory names
      
      Minor changes:
      - rename: x -> filename in helper.generate_content()
      d1de0d99
    • Maciej Sumalvico's avatar
      Substituted '\u0364' for U+0364. · 8dd7cded
      Maciej Sumalvico authored
      The character U+0364 (combining latin small letter e):
      - is invisible in some terminal fonts,
      - breaks syntax highlighting in Vim.
      8dd7cded
    • Maciej Sumalvico's avatar
      Refactored process_dta_data.py · dabf0d2f
      Maciej Sumalvico authored
      - setup_spacy() and parse_arguments() as separate functions
      - more readable formatting
      dabf0d2f
  9. Jan 16, 2019
  10. Dec 15, 2018
  11. Dec 14, 2018
  12. Nov 21, 2018
  13. Nov 16, 2018
    • Robert Sachunsky's avatar
      addition for 13c50055 · 02f07b01
      Robert Sachunsky authored
      during lexicon extraction, also add '/' as infix and suffix to Spacy's tokenizer
      02f07b01
    • Robert Sachunsky's avatar
      correction for d3eab2d0ccc: · ceb750a6
      Robert Sachunsky authored
      fix transducer definitions:
      
      - when repeating lexicon transducer according to words_per_window,
        the last token takes a space character as well
      
      - further repair inter-word/lm lexicon model:
        - last token also needs a flag acceptor (and a space)
        - edits deleting a space should delete the corresponding flag
          in this model too
      ceb750a6
    • Robert Sachunsky's avatar
      improve lexicon extraction: · 3d9fc6b5
      Robert Sachunsky authored
      - allow (large) input files with more than 1 line
      - use generators (strip lines and split at newline)
      - prune lexicon with combined absolute (<=3) and
        relative (<1e-5) frequency threshold
      - extend number normalization for numerals with
        decimal point and thousands separators
      - normalize umlauts to always use decomposed form
        with diacritical combining e
      - speed up by disabling parser and NER in Spacy
      - add '—' as infix to Spacy's tokenizer
      - add CLI, make available as parameters:
        dictionary path, GT suffix
      3d9fc6b5
    • Robert Sachunsky's avatar
      improve lexicon transducer definitions: · 95e92e6c
      Robert Sachunsky authored
      - when extending lexicon transducer according to composition_depth,
        do not ignore upper/lower case completely, but ensure that
        non-first words are downcased (with infix/zero connection) or
        only upper case (with hyphen connection), and that first words
        are upcased or already upper case
      
      - when extending lexicon transducer with morphology,
        compose *after* compounds were added
      
      - when using lexicon transducer, make sure to allow both precomposed umlauts
        and decomposed (with diacritical combining e);
        also, ensure the final lexicon becomes but an acceptor
      
      - when repeating lexicon transducer according to words_per_window,
        use 1 to N instead of 0 to N (optionalized lexicon), but make sure
        the last (1) token has no space
      
      - repair inter-word/lm lexicon model previously defunct:
        - by stripping initial space from loaded punctuation_right_transducer
        - by correctly synchronizing on flags
      95e92e6c
    • Robert Sachunsky's avatar
      with temporary files as OpenFST interface, use sensible filename patterns, and... · fc41831a
      Robert Sachunsky authored
      with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
      fc41831a
    • Robert Sachunsky's avatar
      with temporary files as OpenFST interface, use sensible filename patterns, and... · 9174504a
      Robert Sachunsky authored
      with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
      9174504a
    • Robert Sachunsky's avatar
    • Robert Sachunsky's avatar
      when combining windows, search for the next existing flag instead of blindly... · bca6846b
      Robert Sachunsky authored
      when combining windows, search for the next existing flag instead of blindly assuming next counting flag always remains in some path even after word merge (fixes crash)
      bca6846b
Loading