- Feb 05, 2019
-
-
Maciej Sumalvico authored
Changes in get_precision_recall(): - Refactoring: separating the funtionality of merging alignments from scoring. - Changed the definition of true/false and positive/negative. The characters that are originally wrong and wrongly corrected are now false positives (previously: false negatives). This also changes the evaluation results quite significantly! - Bugfix: consider only one-to-one or one-to-zero alignments, but not one-to-many. This also changes the results. - Code cleaning.
-
- Feb 01, 2019
-
-
Maciej Sumalvico authored
`process` was renamed to `correct_string` in 2fc4eb20, but not in line 235. This commit fixes it.
-
- Jan 30, 2019
-
-
Maciej Sumalvico authored
- command-line argument parsing moved into a separate function - computing each evaluation metric over all lines moved into a separate function, so that the logic of main() reduces to a simple three-way `if` - increased the spacing between top-level declarations to two blank lines (PEP 8) - added the -G parameter for providing the ground truth suffix
-
Maciej Sumalvico authored
Isolated some activities into subfuctions for better structuring. Changed the logging level of showing input/output strings from "info" to "debug".
-
Maciej Sumalvico authored
- grouped the globals into two dictionaries: `gl_config` and `model` - renamed `process` to `correct_string` - renamed `load_model` to `build_model` (does other things apart from loading) - isolated some functionality from the `main` function - parallel processing of strings -> `parallel_process` - printing results -> `print_results` - building transducer composition, flag encoder and loading LM transducers -> `build_model` - minimized the availability of globals to increase readability and avoid bugs - globals are only visible in `main()` and `correct_string()`, but not in any subfunctions that `main()` calls - instead of passing `args` (the argument parser) as a global, the dictionary `gl_config` is used, which contains only the values used by `correct_string()`
-
- Jan 29, 2019
-
-
Maciej Sumalvico authored
The model-building functions in sliding_window.py (load_transducers_*()) previously contained three kinds of functionalities: - loading transducers - variant-specific combining of transducers to a single token acceptor - variant-independent functionality, which is copy-pasted in all three functions (adding flags, converting a single token acceptor to a window etc.) This commit isolates the variant-independent functionality into smaller functions build_single_token_acceptor_*(), combines the variant-independent parts for all three variants in the function build_model() and puts the loading of transducers outside of the `sliding_window` module. Furthermore: - renamed process_test_data.load_transducers() to load_model
-
Maciej Sumalvico authored
- isolated loading transducers into a separate function - isolated preparing the composition of lexicon and model into a separate function - moved process() before main() - cleaned up commented-out code, old file names etc.
-
- Jan 22, 2019
-
-
Maciej Sumalvico authored
-
- Jan 21, 2019
-
-
Maciej Sumalvico authored
- divided the body into smaller functions - fixed (in line 140) a bug causing types that are identical in capitalized and uncapitalized form (like '—') to be counted double
-
- Jan 18, 2019
-
-
Maciej Sumalvico authored
- refactored the reading of training data
-
Maciej Sumalvico authored
- functionalities isolated into separated functions: - creating a single error transducer - combining error transducers - fixed a bug causing only context = 3 to be considered (line 467 pre-commit, previously line 41 in error_transducer_complete.py) - simplified transducer creation
-
Maciej Sumalvico authored
Merged `error_transducer_complete.py` into `error_transducer.py`, so that one module is responsible for training an error model.
-
Maciej Sumalvico authored
- isolated parse_arguments() as a separate function - added a -G parameter (gt_suffix) instead of a fixed suffix - removed some unnecessary comments
-
Maciej Sumalvico authored
process_dta_data.py was merged into lexicon_transducer.py so that only a single module is responsible for building the lexicon. The lexica are no more saved as plaintext. This information can be easily obtained with `hfst-fst2strings -w`. helper.py: - the logarithm of frequencies is computed in the normalizing function, rather than during writing to file
-
- Jan 17, 2019
-
-
Maciej Sumalvico authored
- Model files are moved to a separate repository ('cor-asv-fst-models'). As a temporary solution, the directory 'hfst/fst' has to be linked to the location of the model repository so that the hard-coded paths to model files work. - Changed the hard-coded model file names in process_test_data.py to match the names of files created by the training scripts.
-
Maciej Sumalvico authored
- using os.path.join() instead of string concatenation - removed (useless) trailing slashes from directory names Minor changes: - rename: x -> filename in helper.generate_content()
-
Maciej Sumalvico authored
The character U+0364 (combining latin small letter e): - is invisible in some terminal fonts, - breaks syntax highlighting in Vim.
-
Maciej Sumalvico authored
- setup_spacy() and parse_arguments() as separate functions - more readable formatting
-
- Jan 16, 2019
-
-
Maciej Sumalvico authored
- Isolated parsing command-line arguments as a separate function. - More readable formatting.
-
Maciej Sumalvico authored
- Refactoring: structured the body of merge_states() into smaller subfunctions. - Crashes were caused by deleting transitions from the transducer while iterating over it. In the updated version, the transitions to delete are first identified and then deleted.
-
Maciej Sumalvico authored
reason: easier debugging
-
- Dec 15, 2018
-
-
Robert Sachunsky authored
-
Robert Sachunsky authored
amend f25d826d7: in processor, when splitting input into symbols and flags, fix a new crash with decomposed/combining characters
-
Robert Sachunsky authored
-
Robert Sachunsky authored
replace slow and memory/stack-devouring alignment.sequencealigner by fast difflib.SequenceMatcher in evaluation too
-
- Dec 14, 2018
-
-
Robert Sachunsky authored
allow training error transducer on CSV file instead of directory, plus Python 2 compatibility and some Pylint cosmetics
-
- Nov 21, 2018
-
-
Robert Sachunsky authored
-
Robert Sachunsky authored
-
Robert Sachunsky authored
with REJECTION_WEIGHT set to -1, behave differently: - pyComposition (if enabled) does no backoff_result() - union with input (and input weight mapping) is instead done in compose_and_search afterwards, but with a long vector of sensible thresholds to each apply alternatively on the same window result (i.e. without having to rerun everything) - so all results up the call graph now have to be vectorized as well: compose_and_search, create_result_transducer, window_size_1_2, main/process; in the normal case, there is only 1 value in that vector
-
Robert Sachunsky authored
- catch and show exceptions among pool workers - exit with failure if exceptions occurred in the end
-
Robert Sachunsky authored
- normalize Unicode strings to normal form, and respect remaining decomposed characters when splitting input into symbols and flags (fixes crashes) - amend 4ebb55e5: without openFST, as_transducer can still be True already
-
Robert Sachunsky authored
-
- Nov 16, 2018
-
-
Robert Sachunsky authored
during lexicon extraction, also add '/' as infix and suffix to Spacy's tokenizer
-
Robert Sachunsky authored
fix transducer definitions: - when repeating lexicon transducer according to words_per_window, the last token takes a space character as well - further repair inter-word/lm lexicon model: - last token also needs a flag acceptor (and a space) - edits deleting a space should delete the corresponding flag in this model too
-
Robert Sachunsky authored
- allow (large) input files with more than 1 line - use generators (strip lines and split at newline) - prune lexicon with combined absolute (<=3) and relative (<1e-5) frequency threshold - extend number normalization for numerals with decimal point and thousands separators - normalize umlauts to always use decomposed form with diacritical combining e - speed up by disabling parser and NER in Spacy - add '—' as infix to Spacy's tokenizer - add CLI, make available as parameters: dictionary path, GT suffix
-
Robert Sachunsky authored
- when extending lexicon transducer according to composition_depth, do not ignore upper/lower case completely, but ensure that non-first words are downcased (with infix/zero connection) or only upper case (with hyphen connection), and that first words are upcased or already upper case - when extending lexicon transducer with morphology, compose *after* compounds were added - when using lexicon transducer, make sure to allow both precomposed umlauts and decomposed (with diacritical combining e); also, ensure the final lexicon becomes but an acceptor - when repeating lexicon transducer according to words_per_window, use 1 to N instead of 0 to N (optionalized lexicon), but make sure the last (1) token has no space - repair inter-word/lm lexicon model previously defunct: - by stripping initial space from loaded punctuation_right_transducer - by correctly synchronizing on flags
-
Robert Sachunsky authored
with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
-
Robert Sachunsky authored
with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
-
Robert Sachunsky authored
-
Robert Sachunsky authored
when combining windows, search for the next existing flag instead of blindly assuming next counting flag always remains in some path even after word merge (fixes crash)
-