HFST - Helsinki Finite-State Transducer Technology - Python API
version 3.12.2
|
A transducer maps strings into strings.Strings are tokenized (i.e.
divided) into symbols. Each transition in a transducer has an input and output symbol. If the input symbol of a transition matches a symbol of an input string, it is consumed and an output symbol equal to the output symbol of the transition is produced.
There are some special symbols: the epsilon, unknown and identity. Epsilon on input side consumes no symbol, epsilon on output side produces no symbol. The unknown on input side matches any symbol, the unknown on output side produces any symbol. If unknown is on both sides of a transition, it matches any symbol and produces any symbol other than the one that was matched on the input side. The identity matches any symbol and produces the same symbol. It must always occur on both sides of a transition. There is also a class of special symbols, called flag diacritics. They are of form
@[PNDRCU][.][A-Z]+([.][A-Z]+)?@
The internal string representation for epsilon is "@_EPSILON_SYMBOL_@" (hfst.EPSILON), for unknown "@_UNKNOWN_SYMBOL_@" (hfst.UNKNOWN) and for identity "@_IDENTITY_SYMBOL_@" (hfst.IDENTITY). These strings are used when referring to those symbols in individual transitions, e.g.
fsm = hfst.HfstBasicTransducer() fsm.add_state(1) fsm.add_state(2) fsm.set_final_weight(2, 0.5) fsm.add_transition(0, 1, hfst.EPSILON, hfst.UNKNOWN) fsm.add_transition(1, 2, hfst.IDENTITY, hfst.IDENTITY)
or reading and printing transitions in AT&& format:
0 1 @_EPSILON_SYMBOL@ @_UNKNOWN_SYMBOL_@ 0.0 1 2 @_IDENTITY_SYMBOL@ @_IDENTITY_SYMBOL_@ 0.0 2 0.5
There is also a shorter string for epsilon in AT&T format, "@0@".
The syntax of regular expressions follows the Xerox formalism, where the following symbols are used instead: "0" for epsilon, and "?" for unknown and identity. On either side of a transition, "?" means the unknown. As a single symbol, "?" means identity-to-identity transition. On both sides of a transition "?" means the combination of unknown-to-unknown AND identity-to-identity transitions. If unknown-to-unknown transition is needed, it can be given as the subtraction [?:? - ?]. Some examples:
hfst.regex('0:foo') # epsilon to "foo" hfst.regex('0:foo') # "foo" to epsilon hfst.regex('?:foo') # any symbol to "foo" hfst.regex('?:foo') # "foo" to any symbol hfst.regex('?:?') # any symbol to any symbol hfst.regex('?') # any symbol to the same symbol hfst.regex('?:? - ?') # any symbol to any other symbol
Note that unknowns and identities are expanded with the symbols that the expression becomes aware of during its compilation:
hfst.regex('?') # equal to [?] hfst.regex('? foo') # equal to [[?|foo] foo] hfst.regex('? foo bar:?') # equal to [[?|foo|bar] foo [bar:?|bar:bar|bar:foo]]
In lookup, the epsilon is printed as empty string and unknowns and identities as those symbols that they are matched with:
>>> tr = hfst.regex('foo:0 bar:? ?') >>> print(tr.lookup('foobara')) (('bara', 0.0), ('fooa', 0.0))
In extract_paths, epsilon, unknown and identity are printed as such:
>>> tr = hfst.regex('foo:0 bar:? ?') >>> print(tr.extract_paths()) {'foobar@_IDENTITY_SYMBOL_@': [('@_EPSILON_SYMBOL_@@_UNKNOWN_SYMBOL_@@_IDENTITY_SYMBOL_@', 0.0), ('@_EPSILON_SYMBOL_@bar@_IDENTITY_SYMBOL_@', 0.0), ('@_EPSILON_SYMBOL_@foo@_IDENTITY_SYMBOL_@', 0.0)], 'foobarfoo': [('@_EPSILON_SYMBOL_@@_UNKNOWN_SYMBOL_@foo', 0.0), ('@_EPSILON_SYMBOL_@barfoo', 0.0), ('@_EPSILON_SYMBOL_@foofoo', 0.0)], 'foobarbar': [('@_EPSILON_SYMBOL_@@_UNKNOWN_SYMBOL_@bar', 0.0), ('@_EPSILON_SYMBOL_@barbar', 0.0), ('@_EPSILON_SYMBOL_@foobar', 0.0)]}