HFST - Helsinki Finite-State Transducer Technology - Python API
version 3.12.1
|
HFST API for Python. More...
Namespaces | |
exceptions | |
exceptions... | |
sfst_rules | |
rules... | |
xerox_rules | |
Xerox-type replace rules. | |
Classes | |
class | AttReader |
A class for reading input in AT&T text format and converting it into transducer(s). More... | |
class | HfstBasicTransducer |
A simple transducer class with tropical weights. More... | |
class | HfstBasicTransition |
A transition class that consists of a target state, input and output symbols and a a tropical weight. More... | |
class | HfstInputStream |
A stream for reading HFST binary transducers. More... | |
class | HfstOutputStream |
A stream for writing binary transducers. More... | |
class | HfstTokenizer |
A tokenizer for creating transducers from UTF-8 strings. More... | |
class | HfstTransducer |
A synchronous finite-state transducer. More... | |
class | ImplementationType |
Back-end implementations. More... | |
class | MultiCharSymbolTrie |
TODO: documentation ??? More... | |
class | PmatchContainer |
A class for performing pattern matching. More... | |
class | PrologReader |
A class for reading input in prolog text format and converting it into transducer(s). More... | |
class | XreCompiler |
A regular expression compiler. More... | |
Functions | |
def | compile_lexc_file |
Compile lexc file filename into a transducer. More... | |
def | compile_pmatch_expression |
Compile a pmatch expression into a tuple of transducers. More... | |
def | compile_pmatch_file |
Compile pmatch expressions as defined in filename and return a tuple of transducers. More... | |
def | compile_xfst_file |
Compile (is 'run' a better term?) xfst file filename. More... | |
def | empty_fst |
Get an empty transducer. More... | |
def | epsilon_fst |
Get an epsilon transducer. More... | |
def | fst |
Get a transducer that recognizes one or more paths. More... | |
def | fst_type_to_string |
Get a string representation of transducer implementation type type. More... | |
def | get_default_fst_type |
Get default transducer implementation type. More... | |
def | is_diacritic |
Whether symbol symbol is a flag diacritic. More... | |
def | read_att_input |
Read AT&T input from the user and return a transducer. More... | |
def | read_att_string |
Read a multiline string att and return a transducer. More... | |
def | read_att_transducer |
Read next transducer from AT&T file pointed by f. More... | |
def | read_prolog_transducer |
def | regex |
Get a transducer as defined by regular expression regexp. More... | |
def | set_default_fst_type |
Set the default implementation type. More... | |
def | start_xfst |
Start interactive xfst compiler. More... | |
def | tokenized_fst |
Get a transducer that recognizes the concatenation of symbols or symbol pairs in arg. More... | |
Variables | |
string | EPSILON = '@_EPSILON_SYMBOL_@' |
The string for epsilon symbol. More... | |
string | IDENTITY = '@_IDENTITY_SYMBOL_@' |
The string for identity symbol. More... | |
string | UNKNOWN = '@_UNKNOWN_SYMBOL_@' |
The string for unknown symbol. More... | |
HFST API for Python.
def hfst.compile_lexc_file | ( | filename, | |
kwargs | |||
) |
Compile lexc file filename into a transducer.
filename | The name of the lexc file. |
kwargs | Arguments recognized are: verbosity, with_flags, output. |
verbosity | The verbosity of the compiler, defaults to 0 (silent). Possible values are: 0, 1, 2. |
with_flags | Whether lexc flags are used when compiling, defaults to False. |
output | Where output is printed. Possible values are sys.stdout, sys.stderr, a StringIO, sys.stderr being the default? |
def hfst.compile_pmatch_expression | ( | expr | ) |
Compile a pmatch expression into a tuple of transducers.
expr | A string defining how pmatch is done. |
def hfst.compile_pmatch_file | ( | filename | ) |
Compile pmatch expressions as defined in filename and return a tuple of transducers.
An example:
If we have a file named streets.txt that contains:
define CapWord UppercaseAlpha Alpha* ; define StreetWordFr [{avenue} | {boulevard} | {rue}] ; define DeFr [ [{de} | {du} | {des} | {de la}] Whitespace ] | [{d'} | {l'}] ; define StreetFr StreetWordFr (Whitespace DeFr) CapWord+ ; regex StreetFr EndTag(FrenchStreetName) ;
we can run:
defs = hfst.compile_pmatch_file('streets.txt') const = hfst.PmatchContainer(defs) assert cont.match("Je marche seul dans l'avenue des Ternes.") == "Je marche seul dans l'<FrenchStreetName>avenue des Ternes</FrenchStreetName>."
def hfst.compile_xfst_file | ( | filename, | |
kwargs | |||
) |
Compile (is 'run' a better term?) xfst file filename.
filename | The name of the xfst file. |
kwargs | Arguments recognized are: verbosity, quit_on_fail, output, type. |
verbosity | The verbosity of the compiler, defaults to 0 (silent). Possible values are: 0, 1, 2. |
quit_on_fail | Whether the script is exited on any error, defaults to True. |
output | Where output is printed. Possible values are sys.stdout, sys.stderr, a StringIO, sys.stderr being the default? |
type | Implementation type of the compiler, defaults to hfst.get_default_fst_type(). |
def hfst.empty_fst | ( | ) |
Get an empty transducer.
Empty transducer has one state that is not final, i.e. it does not recognize any string.
def hfst.epsilon_fst | ( | weight = 0 | ) |
Get an epsilon transducer.
weight | The weight of the final state. Epsilon transducer has one state that is final (with final weight weight), i.e. it recognizes the empty string. |
def hfst.fst | ( | arg | ) |
Get a transducer that recognizes one or more paths.
arg | See example below |
Possible inputs:
One unweighted identity path: 'foo' -> [f o o] Weighted path: a tuple of string and number, e.g. ('foo',1.4) ('bar',-3) ('baz',0) Several paths: a list or a tuple of paths and/or weighted paths, e.g. ['foo', 'bar'] ('foo', ('bar',5.0)) ('foo', ('bar',5.0), 'baz', 'Foo', ('Bar',2.4)) [('foo',-1), ('bar',0), ('baz',3.5)] A dictionary mapping strings to any of the above cases: {'foo':'foo', 'bar':('foo',1.4), 'baz':(('foo',-1),'BAZ')}
def hfst.fst_type_to_string | ( | type | ) |
Get a string representation of transducer implementation type type.
type | An hfst.ImplementationType. |
def hfst.get_default_fst_type | ( | ) |
Get default transducer implementation type.
If the default type is not set, it defaults to hfst.ImplementationType.TROPICAL_OPENFST_TYPE
def hfst.is_diacritic | ( | symbol | ) |
Whether symbol symbol is a flag diacritic.
Flag diacritics are of the form
@[PNDRCU][.][A-Z]+([.][A-Z]+)?@
def hfst.read_att_input | ( | ) |
Read AT&T input from the user and return a transducer.
def hfst.read_att_string | ( | att | ) |
Read a multiline string att and return a transducer.
att | A string in AT&& format that defines the transducer. |
def hfst.read_att_transducer | ( | f, | |
epsilonstr = hfst.EPSILON |
|||
) |
Read next transducer from AT&T file pointed by f.
epsilonstr defines the symbol used for epsilon in the file.
f | A python file |
epsilonstr | How epsilon is represented in the file. By default, "@_EPSILON_SYMBOL_@" and "@0@" are both recognized. |
If the file contains several transducers, they must be separated by "--" lines. In AT&T format, the transition lines are of the form:
[0-9]+[\w]+[0-9]+[\w]+[^\w]+[\w]+[^\w]([\w]+(-)[0-9]+(\.[0-9]+))
and final state lines:
[0-9]+[\w]+([\w]+(-)[0-9]+(\.[0-9]+))
If several transducers are listed in the same file, they are separated by lines of two consecutive hyphens "--". If the weight
([\w]+(-)[0-9]+(\.[0-9]+))
is missing, the transition or final state is given a zero weight.
NOTE: If transition symbols contains spaces, they must be escaped as '@_SPACE_@' because spaces are used as field separators. Both '@0@' and '@_EPSILON_SYMBOL_@' are always interpreted as epsilons.
An example:
0 1 foo bar 0.3 1 0.5 -- 0 0.0 -- -- 0 0.0 0 0 a <eps> 0.2
The example lists four transducers in AT&T format: one transducer accepting the string pair <'foo','bar'>, one epsilon transducer, one empty transducer and one transducer that accepts any number of 'a's and produces an empty string in all cases. The transducers can be read with the following commands (from a file named 'testfile.att'):
transducers = [] ifile = open('testfile.att', 'r') try: while (True): t = hfst.read_att_transducer(ifile, '<eps>') transducers.append(t) print("read one transducer") except hfst.exceptions.NotValidAttFormatException as e: print("Error reading transducer: not valid AT&T format.") except hfst.exceptions.EndOfStreamException as e: pass ifile.close() print("Read %i transducers in total" % len(transducers))
Epsilon will be represented as hfst.EPSILON in the resulting transducer. The argument epsilon_symbol only denotes how epsilons are represented in ifile.
NotValidAttFormatException | |
StreamNotReadableException | |
StreamIsClosedException | |
EndOfStreamException |
def hfst.read_prolog_transducer | ( | f | ) |
def hfst.regex | ( | regexp, | |
kwargs | |||
) |
Get a transducer as defined by regular expression regexp.
regexp | The regular expression defined with Xerox transducer notation. |
kwargs | Argumnets recognized are: error. |
error | Where warnings and errors are printed. Possible values are sys.stdout, sys.stderr (the default), a StringIO or None, indicating a quiet mode. |
def hfst.set_default_fst_type | ( | impl | ) |
Set the default implementation type.
impl | An hfst.ImplementationType. |
Set the implementation type (SFST_TYPE, TROPICAL_OPENFST_TYPE, FOMA_TYPE) that is used by default by all operations that create transducers. The default value is TROPICAL_OPENFST_TYPE
def hfst.start_xfst | ( | kwargs | ) |
Start interactive xfst compiler.
kwargs | Arguments recognized are: type, quit_on_fail. |
quit_on_fail | Whether the compiler exits on any error, defaults to False. |
type | Implementation type of the compiler, defaults to hfst.get_default_fst_type(). |
def hfst.tokenized_fst | ( | arg, | |
weight = 0 |
|||
) |
Get a transducer that recognizes the concatenation of symbols or symbol pairs in arg.
arg | The symbols or symbol pairs that form the path to be recognized. |
Example
import hfst tok = hfst.HfstTokenizer() tok.add_multichar_symbol('foo') tok.add_multichar_symbol('bar') tr = hfst.tokenized_fst(tok.tokenize('foobar', 'foobaz'))
will create the transducer [foo:foo bar:b 0:a 0:z]
string EPSILON = '@_EPSILON_SYMBOL_@' |
The string for epsilon symbol.
An example:
fsm = hfst.HfstBasicTransducer() fsm.add_state(1) fsm.set_final_weight(1, 2.0) fsm.add_transition(0, 1, "foo", hfst.EPSILON) if not hfst.HfstTransducer(fsm).compare(hfst.regex('foo:0::2.0')): raise RuntimeError('')
string IDENTITY = '@_IDENTITY_SYMBOL_@' |
The string for identity symbol.
An example:
fsm = hfst.HfstBasicTransducer() fsm.add_state(1) fsm.set_final_weight(1, 1.5) fsm.add_transition(0, 1, hfst.IDENTITY, hfst.IDENTITY) if not hfst.HfstTransducer(fsm).compare(hfst.regex('?::1.5')): raise RuntimeError('')
string UNKNOWN = '@_UNKNOWN_SYMBOL_@' |
The string for unknown symbol.
An example:
fsm = hfst.HfstBasicTransducer() fsm.add_state(1) fsm.set_final_weight(1, -0.5) fsm.add_transition(0, 1, "foo", hfst.UNKNOWN) fsm.add_transition(0, 1, "foo", "foo") if not hfst.HfstTransducer(fsm).compare(hfst.regex('foo:?::-0.5')): raise RuntimeError('')