- Dec 15, 2018
-
-
Robert Sachunsky authored
-
Robert Sachunsky authored
amend f25d826d7: in processor, when splitting input into symbols and flags, fix a new crash with decomposed/combining characters
-
Robert Sachunsky authored
-
Robert Sachunsky authored
replace slow and memory/stack-devouring alignment.sequencealigner by fast difflib.SequenceMatcher in evaluation too
-
- Dec 14, 2018
-
-
Robert Sachunsky authored
allow training error transducer on CSV file instead of directory, plus Python 2 compatibility and some Pylint cosmetics
-
- Nov 21, 2018
-
-
Robert Sachunsky authored
-
Robert Sachunsky authored
-
Robert Sachunsky authored
with REJECTION_WEIGHT set to -1, behave differently: - pyComposition (if enabled) does no backoff_result() - union with input (and input weight mapping) is instead done in compose_and_search afterwards, but with a long vector of sensible thresholds to each apply alternatively on the same window result (i.e. without having to rerun everything) - so all results up the call graph now have to be vectorized as well: compose_and_search, create_result_transducer, window_size_1_2, main/process; in the normal case, there is only 1 value in that vector
-
Robert Sachunsky authored
- catch and show exceptions among pool workers - exit with failure if exceptions occurred in the end
-
Robert Sachunsky authored
- normalize Unicode strings to normal form, and respect remaining decomposed characters when splitting input into symbols and flags (fixes crashes) - amend 4ebb55e5: without openFST, as_transducer can still be True already
-
Robert Sachunsky authored
-
- Nov 16, 2018
-
-
Robert Sachunsky authored
during lexicon extraction, also add '/' as infix and suffix to Spacy's tokenizer
-
Robert Sachunsky authored
fix transducer definitions: - when repeating lexicon transducer according to words_per_window, the last token takes a space character as well - further repair inter-word/lm lexicon model: - last token also needs a flag acceptor (and a space) - edits deleting a space should delete the corresponding flag in this model too
-
Robert Sachunsky authored
- allow (large) input files with more than 1 line - use generators (strip lines and split at newline) - prune lexicon with combined absolute (<=3) and relative (<1e-5) frequency threshold - extend number normalization for numerals with decimal point and thousands separators - normalize umlauts to always use decomposed form with diacritical combining e - speed up by disabling parser and NER in Spacy - add '—' as infix to Spacy's tokenizer - add CLI, make available as parameters: dictionary path, GT suffix
-
Robert Sachunsky authored
- when extending lexicon transducer according to composition_depth, do not ignore upper/lower case completely, but ensure that non-first words are downcased (with infix/zero connection) or only upper case (with hyphen connection), and that first words are upcased or already upper case - when extending lexicon transducer with morphology, compose *after* compounds were added - when using lexicon transducer, make sure to allow both precomposed umlauts and decomposed (with diacritical combining e); also, ensure the final lexicon becomes but an acceptor - when repeating lexicon transducer according to words_per_window, use 1 to N instead of 0 to N (optionalized lexicon), but make sure the last (1) token has no space - repair inter-word/lm lexicon model previously defunct: - by stripping initial space from loaded punctuation_right_transducer - by correctly synchronizing on flags
-
Robert Sachunsky authored
with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
-
Robert Sachunsky authored
with temporary files as OpenFST interface, use sensible filename patterns, and do not forget to unlink afterwards
-
Robert Sachunsky authored
-
Robert Sachunsky authored
when combining windows, search for the next existing flag instead of blindly assuming next counting flag always remains in some path even after word merge (fixes crash)
-
Robert Sachunsky authored
-
Robert Sachunsky authored
Python-side create_input_transducer is still needed when composing without OpenFST backend (amend d22110f4)
-
Robert Sachunsky authored
in sliding_window's main, never prune away (all) results (backout from d22110f4 here) – hfst's prune has a documentation bug!
-
Robert Sachunsky authored
-
- Nov 14, 2018
-
-
Robert Sachunsky authored
- for file-based interface between hfst and openfst, instead of fixed filenames in CWD (partially using the input string in the filename), use proper temporary filenames - allow preparing input transducers both in Python or in C++, control via prepare_input(as_transducer) - after loading error transducer, ensure its alphabet / symbol table contains all special flags - properly tokenize input strings using the special flags (both in Python, prepare_input / create_input_transducer, and in C++, create_input_transducer) -- fixes crashes due to edited flags - for C++ create_input_transducer, use a SYMBOL StringCompiler with the current symbol table, override fst_field_separator to only use newlines; join on newlines before call in Python - delegate reweighting input transducer with rejection weight, disjoining that with result transducer, and determinization to C++ part if using OpenFST backend - always make result transducer coaccessible/connected to ensure determinize will terminate, fixes crashes - make get_flag_states never repeat states, fixes crashes during merge - make merge_states faster and more robust - add more German compound infixes for lexicon composition depth - improve C++ part: - rename functions sensibly - transparently encode/decode python strings to/from UTF-8 byte strings - avoid dragging along parameters for transducers and files - use (safe) pointers for FSTs instead of copy-by-value, re-use mutable objects as often as possible - use static storage for buffers - use members variables for converters (StringCompiler, WeightConvertMapper) - use logging (verbosity can be controlled via loglevels on CLI) - make logging and output more consistent, no input strings as filenames WARNING: quality will be slightly better, but extremely slow reason still unknown
-
Robert Sachunsky authored
- go back to strings instead of lists for n-grams and FST definition - replace 0 as gap element by non-breaking space, ensure that this does not occur anywhere in input - avoid slow and suboptimal FST definition from dictionary, instead iteratively disjoin n-gram pair tuples (SPV) - also embed special flags for sliding window construction into error transducers alphabet / symbol list
-
Robert Sachunsky authored
- fix loop adding contexts - use repeat_n_minus instead of optionalize+repeat_n
-
Robert Sachunsky authored
-
- Nov 06, 2018
-
-
Robert Sachunsky authored
- replace gap element 'ε' by unambiguous 0 throughout (requires using lists instead of strings between aligner and FST definition) - to count edits, use only 1-best alignment (ignore suboptimal ones), include well-aligned lines (up to 100% identical) but exclude pathological cases (less than 5% match), also ignore empty lines - replace slow and memory/stack-devouring alignment.sequencealigner by fast difflib.SequenceMatcher - avoid string and file conversion (write+read) for edit dictionary (just optionally write a human-readable file) - fix and simplify no-punctuation filter - add CLI, also make available as arguments: - maximum context size per edit, - maximum number of errors per window, - whether or not to preserve punctuation WARNING: the error models created by that appear to perform worse than the original ones still registered in the repo commented because countereffective: - combine simple error transducers by also disjoining full error transducer N with full error transducer N-1 (where N=0 is acceptor)
-
Robert Sachunsky authored
- pass value of CLI option for rejection threshold on to set_transition_weights (via global constant, in lieu of better encapsulation) - resurrect option to apply language model for rescoring, (slows down significantly and deteriorates results for now), add CLI option
-
Robert Sachunsky authored
-
Robert Sachunsky authored
- resurrect inter-word and preserve model here as well - resurrect application of LM - improve CLI, make available as arguments: - punctuation model (bracket vs inter-word/LM vs preserve) - words per window - number of result paths per window - lexicon composition depth - reduce transition weights for input transducer (joined to result transducer), acting as rejection threshold: significantly reduces overcorrection
-
Robert Sachunsky authored
-
- Nov 05, 2018
-
-
Robert Sachunsky authored
-
- Nov 04, 2018
-
-
Robert Sachunsky authored
-
Robert Sachunsky authored
-
Robert Sachunsky authored
- add CLI, make available as arguments: directory path, OCR and correction suffix, window size, composition depth, result paths - give use full control of prefix (do not assume .txt)
-
Robert Sachunsky authored
- use zero as gap element (to avoid confusion with true ε - use StrictGlobalSequenceAligner (to make sure the complete sequence gets used) - make extra metric for umlauts (confused with diacritical combining e) actually work (proper state changes, no length adjustment) - simplify micro-averaged CER calculation (edit and length counts) - add CLI, make available as arguments: directory path, OCR and correction suffix - give user full control of prefix (do not assume .txt)
-
Robert Sachunsky authored
-
- Nov 01, 2018
-
-
Lena Schiffer authored
-
- Oct 01, 2018
-
-
Lena Schiffer authored
-