- Feb 15, 2019
-
-
Maciej Sumalvico authored
-
- Feb 14, 2019
-
-
Maciej Sumalvico authored
-
Maciej Sumalvico authored
-
Maciej Sumalvico authored
sliding_window_no_flags.process_window_with_openfst() now works properly
-
- Feb 13, 2019
-
-
Maciej Sumalvico authored
- much cleaner and smaller code - without flag diacritics and state merging - the windows are recombined by adding transitions between consecutive windows (see the PDF documentation for details) - test_sliding_window_no_flags.py is a temporary testing script (intended to be removed after the module is integrated into main processing) - known issues: - process_window_with_openfst() doesn't work - the composition returns a transducer accepting garbage paths - process_window_with_hfst() is very slow - not yet integrated into process_test_data.py
-
- Feb 11, 2019
-
-
Maciej Sumalvico authored
(duplicate of helper.save_transducer())
-
Maciej Sumalvico authored
-
- Feb 07, 2019
-
-
Maciej Sumalvico authored
- hfst/ -> ./ - cython/ -> extensions/ - currently unused files moved to __DEPRECATED/ - added .gitignore
-
Maciej Sumalvico authored
- hfst/ - ASSE data removed - open-fst/ - transducers removed (most are duplicates of the ones stored in cor-asv-fst-models; others, like the ASSE lexicon, can be trained if needed; shouldn't clutter the repo) - the report about evaluation experiments moved to doc-fst/evaluation-openfst - gesamt_dta_spaces.syms removed
-
Maciej Sumalvico authored
-
- Feb 05, 2019
-
-
Maciej Sumalvico authored
Changes in get_precision_recall(): - Refactoring: separating the funtionality of merging alignments from scoring. - Changed the definition of true/false and positive/negative. The characters that are originally wrong and wrongly corrected are now false positives (previously: false negatives). This also changes the evaluation results quite significantly! - Bugfix: consider only one-to-one or one-to-zero alignments, but not one-to-many. This also changes the results. - Code cleaning.
-
- Feb 01, 2019
-
-
Maciej Sumalvico authored
`process` was renamed to `correct_string` in 2fc4eb20, but not in line 235. This commit fixes it.
-
- Jan 30, 2019
-
-
Maciej Sumalvico authored
- command-line argument parsing moved into a separate function - computing each evaluation metric over all lines moved into a separate function, so that the logic of main() reduces to a simple three-way `if` - increased the spacing between top-level declarations to two blank lines (PEP 8) - added the -G parameter for providing the ground truth suffix
-
Maciej Sumalvico authored
Isolated some activities into subfuctions for better structuring. Changed the logging level of showing input/output strings from "info" to "debug".
-
Maciej Sumalvico authored
- grouped the globals into two dictionaries: `gl_config` and `model` - renamed `process` to `correct_string` - renamed `load_model` to `build_model` (does other things apart from loading) - isolated some functionality from the `main` function - parallel processing of strings -> `parallel_process` - printing results -> `print_results` - building transducer composition, flag encoder and loading LM transducers -> `build_model` - minimized the availability of globals to increase readability and avoid bugs - globals are only visible in `main()` and `correct_string()`, but not in any subfunctions that `main()` calls - instead of passing `args` (the argument parser) as a global, the dictionary `gl_config` is used, which contains only the values used by `correct_string()`
-
- Jan 29, 2019
-
-
Maciej Sumalvico authored
The model-building functions in sliding_window.py (load_transducers_*()) previously contained three kinds of functionalities: - loading transducers - variant-specific combining of transducers to a single token acceptor - variant-independent functionality, which is copy-pasted in all three functions (adding flags, converting a single token acceptor to a window etc.) This commit isolates the variant-independent functionality into smaller functions build_single_token_acceptor_*(), combines the variant-independent parts for all three variants in the function build_model() and puts the loading of transducers outside of the `sliding_window` module. Furthermore: - renamed process_test_data.load_transducers() to load_model
-
Maciej Sumalvico authored
- isolated loading transducers into a separate function - isolated preparing the composition of lexicon and model into a separate function - moved process() before main() - cleaned up commented-out code, old file names etc.
-
- Jan 22, 2019
-
-
Maciej Sumalvico authored
-
- Jan 21, 2019
-
-
Maciej Sumalvico authored
- divided the body into smaller functions - fixed (in line 140) a bug causing types that are identical in capitalized and uncapitalized form (like '—') to be counted double
-
- Jan 18, 2019
-
-
Maciej Sumalvico authored
- refactored the reading of training data
-
Maciej Sumalvico authored
- functionalities isolated into separated functions: - creating a single error transducer - combining error transducers - fixed a bug causing only context = 3 to be considered (line 467 pre-commit, previously line 41 in error_transducer_complete.py) - simplified transducer creation
-
Maciej Sumalvico authored
Merged `error_transducer_complete.py` into `error_transducer.py`, so that one module is responsible for training an error model.
-
Maciej Sumalvico authored
- isolated parse_arguments() as a separate function - added a -G parameter (gt_suffix) instead of a fixed suffix - removed some unnecessary comments
-
Maciej Sumalvico authored
process_dta_data.py was merged into lexicon_transducer.py so that only a single module is responsible for building the lexicon. The lexica are no more saved as plaintext. This information can be easily obtained with `hfst-fst2strings -w`. helper.py: - the logarithm of frequencies is computed in the normalizing function, rather than during writing to file
-
- Jan 17, 2019
-
-
Maciej Sumalvico authored
- Model files are moved to a separate repository ('cor-asv-fst-models'). As a temporary solution, the directory 'hfst/fst' has to be linked to the location of the model repository so that the hard-coded paths to model files work. - Changed the hard-coded model file names in process_test_data.py to match the names of files created by the training scripts.
-
Maciej Sumalvico authored
- using os.path.join() instead of string concatenation - removed (useless) trailing slashes from directory names Minor changes: - rename: x -> filename in helper.generate_content()
-
Maciej Sumalvico authored
The character U+0364 (combining latin small letter e): - is invisible in some terminal fonts, - breaks syntax highlighting in Vim.
-
Maciej Sumalvico authored
- setup_spacy() and parse_arguments() as separate functions - more readable formatting
-
- Jan 16, 2019
-
-
Maciej Sumalvico authored
- Isolated parsing command-line arguments as a separate function. - More readable formatting.
-
Maciej Sumalvico authored
- Refactoring: structured the body of merge_states() into smaller subfunctions. - Crashes were caused by deleting transitions from the transducer while iterating over it. In the updated version, the transitions to delete are first identified and then deleted.
-
Maciej Sumalvico authored
reason: easier debugging
-
- Jan 08, 2019
-
-
Robert Sachunsky authored
-
- Dec 15, 2018
-
-
Robert Sachunsky authored
-
Robert Sachunsky authored
amend f25d826d7: in processor, when splitting input into symbols and flags, fix a new crash with decomposed/combining characters
-
Robert Sachunsky authored
-
Robert Sachunsky authored
replace slow and memory/stack-devouring alignment.sequencealigner by fast difflib.SequenceMatcher in evaluation too
-
- Dec 14, 2018
-
-
Robert Sachunsky authored
allow training error transducer on CSV file instead of directory, plus Python 2 compatibility and some Pylint cosmetics
-
- Nov 21, 2018
-
-
Robert Sachunsky authored
-
Robert Sachunsky authored
-
Robert Sachunsky authored
with REJECTION_WEIGHT set to -1, behave differently: - pyComposition (if enabled) does no backoff_result() - union with input (and input weight mapping) is instead done in compose_and_search afterwards, but with a long vector of sensible thresholds to each apply alternatively on the same window result (i.e. without having to rerun everything) - so all results up the call graph now have to be vectorized as well: compose_and_search, create_result_transducer, window_size_1_2, main/process; in the normal case, there is only 1 value in that vector
-