Skip to content
Snippets Groups Projects
  1. Dec 15, 2018
  2. Dec 14, 2018
  3. Nov 21, 2018
  4. Nov 16, 2018
  5. Nov 14, 2018
    • Robert Sachunsky's avatar
      fixed and improved FST processor: · e94e2119
      Robert Sachunsky authored
      - for file-based interface between hfst and openfst,
        instead of fixed filenames in CWD (partially using
        the input string in the filename),
        use proper temporary filenames
      - allow preparing input transducers both in Python or in C++,
        control via prepare_input(as_transducer)
      - after loading error transducer, ensure its alphabet /
        symbol table contains all special flags
      - properly tokenize input strings using the special flags
        (both in Python, prepare_input / create_input_transducer,
         and in C++, create_input_transducer) --
        fixes crashes due to edited flags
      - for C++ create_input_transducer, use a SYMBOL StringCompiler
        with the current symbol table, override fst_field_separator
        to only use newlines; join on newlines before call in Python
      - delegate reweighting input transducer with rejection weight,
        disjoining that with result transducer, and determinization
        to C++ part if using OpenFST backend
      - always make result transducer coaccessible/connected to ensure
        determinize will terminate,
        fixes crashes
      - make get_flag_states never repeat states,
        fixes crashes during merge
      - make merge_states faster and more robust
      - add more German compound infixes for lexicon composition depth
      - improve C++ part:
        - rename functions sensibly
        - transparently encode/decode python strings to/from
          UTF-8 byte strings
        - avoid dragging along parameters for transducers and files
        - use (safe) pointers for FSTs instead of copy-by-value,
          re-use mutable objects as often as possible
        - use static storage for buffers
        - use members variables for converters (StringCompiler, WeightConvertMapper)
      - use logging (verbosity can be controlled via loglevels on CLI)
      - make logging and output more consistent, no input strings
        as filenames
      
      WARNING: quality will be slightly better, but extremely slow
               reason still unknown
      e94e2119
    • Robert Sachunsky's avatar
      accelerate creation of error transducers: · 200030c4
      Robert Sachunsky authored
      - go back to strings instead of lists for n-grams and
        FST definition
      - replace 0 as gap element by non-breaking space,
        ensure that this does not occur anywhere in input
      - avoid slow and suboptimal FST definition from dictionary,
        instead iteratively disjoin n-gram pair tuples (SPV)
      - also embed special flags for sliding window construction
        into error transducers alphabet / symbol list
      200030c4
    • Robert Sachunsky's avatar
      repair creation of error transducers: · 276db049
      Robert Sachunsky authored
      - fix loop adding contexts
      - use repeat_n_minus instead of optionalize+repeat_n
      276db049
    • Robert Sachunsky's avatar
      add dot-based graph drawer script · 110f4998
      Robert Sachunsky authored
      110f4998
  6. Nov 06, 2018
    • Robert Sachunsky's avatar
      improve creation of error transducers: · d573abb2
      Robert Sachunsky authored
      - replace gap element 'ε' by unambiguous 0 throughout
        (requires using lists instead of strings between aligner and
         FST definition)
      - to count edits, use only 1-best alignment (ignore suboptimal ones),
        include well-aligned lines (up to 100% identical) but
        exclude pathological cases (less than 5% match), also
        ignore empty lines
      - replace slow and memory/stack-devouring alignment.sequencealigner
        by fast difflib.SequenceMatcher
      - avoid string and file conversion (write+read) for edit dictionary
        (just optionally write a human-readable file)
      - fix and simplify no-punctuation filter
      - add CLI, also make available as arguments:
        - maximum context size per edit,
        - maximum number of errors per window,
        - whether or not to preserve punctuation
      
      WARNING: the error models created by that appear to perform worse than the
      original ones still registered in the repo
      
      commented because countereffective:
      - combine simple error transducers by also disjoining full error transducer N
        with full error transducer N-1 (where N=0 is acceptor)
      d573abb2
    • Robert Sachunsky's avatar
      make rejection threshold available on CLI, resurrect LM · fb4d33f3
      Robert Sachunsky authored
      - pass value of CLI option for rejection threshold on
        to set_transition_weights (via global constant, in lieu
        of better encapsulation)
      - resurrect option to apply language model for rescoring,
        (slows down significantly and deteriorates results for now),
        add CLI option
      fb4d33f3
    • Robert Sachunsky's avatar
    • Robert Sachunsky's avatar
      improve one-shot CLI, improve rejection threshold · 58127ce0
      Robert Sachunsky authored
      - resurrect inter-word and preserve model here as well
      - resurrect application of LM
      - improve CLI, make available as arguments:
        - punctuation model (bracket vs inter-word/LM vs preserve)
        - words per window
        - number of result paths per window
        - lexicon composition depth
      - reduce transition weights for input transducer (joined to result transducer),
        acting as rejection threshold: significantly reduces overcorrection
      58127ce0
    • Robert Sachunsky's avatar
  7. Nov 05, 2018
  8. Nov 04, 2018
  9. Nov 01, 2018
  10. Oct 01, 2018
Loading