HFST - Helsinki Finite-State Transducer Technology - Python API  version 3.12.3 (under development)
Namespaces | Classes | Functions | Variables
hfst Namespace Reference

HFST API for Python. More...

Namespaces

 exceptions
 exceptions...
 
 sfst_rules
 rules...
 
 xerox_rules
 Xerox-type replace rules.
 

Classes

class  AttReader
 A class for reading input in AT&T text format and converting it into transducer(s). More...
 
class  HfstBasicTransducer
 A simple transducer class with tropical weights. More...
 
class  HfstBasicTransition
 A transition class that consists of a target state, input and output symbols and a a tropical weight. More...
 
class  HfstInputStream
 A stream for reading HFST binary transducers. More...
 
class  HfstOutputStream
 A stream for writing binary transducers. More...
 
class  HfstTokenizer
 A tokenizer for creating transducers from UTF-8 strings. More...
 
class  HfstTransducer
 A synchronous finite-state transducer. More...
 
class  ImplementationType
 Back-end implementations. More...
 
class  MultiCharSymbolTrie
 TODO: documentation ??? More...
 
class  PmatchContainer
 A class for performing pattern matching. More...
 
class  PrologReader
 A class for reading input in prolog text format and converting it into transducer(s). More...
 
class  XreCompiler
 A regular expression compiler. More...
 

Functions

def compile_lexc_file (filename, kwargs)
 Compile lexc file filename into a transducer. More...
 
def compile_pmatch_expression (expr)
 Compile a pmatch expression into a tuple of transducers. More...
 
def compile_pmatch_file (filename)
 Compile pmatch expressions as defined in filename and return a tuple of transducers. More...
 
def compile_sfst_file (filename, kwargs)
 Compile sfst file filename into a transducer. More...
 
def compile_xfst_file (filename, kwargs)
 Compile (is 'run' a better term?) xfst file filename. More...
 
def empty_fst ()
 Get an empty transducer. More...
 
def epsilon_fst (weight=0)
 Get an epsilon transducer. More...
 
def fsa_to_fst (fsa, separator='')
 Get a transducer where each transition isymbolSosymbol:isymbolSosymbol of fsa is replaced a transition isymbol:osymbol, if separator is S. More...
 
def fst (arg)
 Get a transducer that recognizes one or more paths. More...
 
def fst_to_fsa (fst, separator='')
 Get a transducer (automaton) where each transition symbol pair isymbol:osymbol of fst is replaced with a transition isymbolosymbol:isymbolosymbol, adding separator between isymbol and osymbol. More...
 
def fst_type_to_string (type)
 Get a string representation of transducer implementation type type. More...
 
def get_default_fst_type ()
 Get default transducer implementation type. More...
 
def is_diacritic (symbol)
 Whether symbol symbol is a flag diacritic. More...
 
def read_att_input ()
 Read AT&T input from the user and return a transducer. More...
 
def read_att_string (att)
 Read a multiline string att and return a transducer. More...
 
def read_att_transducer (f, epsilonstr=hfst.EPSILON)
 Read next transducer from AT&T file pointed by f. More...
 
def read_prolog_transducer (f)
 
def regex (regexp, kwargs)
 Get a transducer as defined by regular expression regexp. More...
 
def set_default_fst_type (impl)
 Set the default implementation type. More...
 
def start_xfst (kwargs)
 Start interactive xfst compiler. More...
 
def tokenized_fst (arg, weight=0)
 Get a transducer that recognizes the concatenation of symbols or symbol pairs in arg. More...
 

Variables

string EPSILON = '@_EPSILON_SYMBOL_@'
 The string for epsilon symbol. More...
 
string IDENTITY = '@_IDENTITY_SYMBOL_@'
 The string for identity symbol. More...
 
string UNKNOWN = '@_UNKNOWN_SYMBOL_@'
 The string for unknown symbol. More...
 

Detailed Description

HFST API for Python.

Function Documentation

def hfst.compile_lexc_file (   filename,
  kwargs 
)

Compile lexc file filename into a transducer.

Parameters
filenameThe name of the lexc file.
kwargsArguments recognized are: verbosity, with_flags, output.
verbosityThe verbosity of the compiler, defaults to 0 (silent). Possible values are: 0, 1, 2.
with_flagsWhether lexc flags are used when compiling, defaults to False.
outputWhere output is printed. Possible values are sys.stdout, sys.stderr, a StringIO, sys.stderr being the default?
Returns
On success the resulting transducer, else None.
def hfst.compile_pmatch_expression (   expr)

Compile a pmatch expression into a tuple of transducers.

Parameters
exprA string defining how pmatch is done.
See also
hfst.compile_pmatch_file
def hfst.compile_pmatch_file (   filename)

Compile pmatch expressions as defined in filename and return a tuple of transducers.

An example:

If we have a file named streets.txt that contains:

define CapWord UppercaseAlpha Alpha* ; define StreetWordFr [{avenue} | {boulevard} | {rue}] ; define DeFr [ [{de} | {du} | {des} | {de la}] Whitespace ] | [{d'} | {l'}] ; define StreetFr StreetWordFr (Whitespace DeFr) CapWord+ ; regex StreetFr EndTag(FrenchStreetName) ;

we can run:

defs = hfst.compile_pmatch_file('streets.txt') const = hfst.PmatchContainer(defs) assert cont.match("Je marche seul dans l'avenue des Ternes.") == "Je marche seul dans l'<FrenchStreetName>avenue des Ternes</FrenchStreetName>."

def hfst.compile_sfst_file (   filename,
  kwargs 
)

Compile sfst file filename into a transducer.

Parameters
filenameThe name of the sfst file.
kwargsArguments recognized are: verbose, output.
verboseWhether sfst file is processed in verbose mode, defaults to False.
outputTODO: Where output is printed. Possible values are sys.stdout, sys.stderr, a StringI0, sys.stderr being the default.
Returns
On success the resulting transducer, else None.
def hfst.compile_xfst_file (   filename,
  kwargs 
)

Compile (is 'run' a better term?) xfst file filename.

Parameters
filenameThe name of the xfst file.
kwargsArguments recognized are: verbosity, quit_on_fail, output, type.
verbosityThe verbosity of the compiler, defaults to 0 (silent). Possible values are: 0, 1, 2.
quit_on_failWhether the script is exited on any error, defaults to True.
outputWhere output is printed. Possible values are sys.stdout, sys.stderr, a StringIO, sys.stderr being the default?
typeImplementation type of the compiler, defaults to hfst.get_default_fst_type().
Returns
On success 0, else an integer greater than 0.
def hfst.empty_fst ( )

Get an empty transducer.

Empty transducer has one state that is not final, i.e. it does not recognize any string.

def hfst.epsilon_fst (   weight = 0)

Get an epsilon transducer.

Parameters
weightThe weight of the final state. Epsilon transducer has one state that is final (with final weight weight), i.e. it recognizes the empty string.
def hfst.fsa_to_fst (   fsa,
  separator = '' 
)

Get a transducer where each transition isymbolSosymbol:isymbolSosymbol of fsa is replaced a transition isymbol:osymbol, if separator is S.

Parameters
fsaThe transducer. Must be an automaton, i.e. for each transition, the input and output symbols must be the same. Else, a TransducerIsNotAutomatonException is thrown.
separatorThe symbol separating input and output symbol parts in fsa. If it is the empty string, length of each symbol in fsa (excluding special symbols of form "@...@") must be exactly 2. Else, a RuntimeError is thrown.
def hfst.fst (   arg)

Get a transducer that recognizes one or more paths.

Parameters
argSee example below

Possible inputs:

 One unweighted identity path:
 'foo'  ->  [f o o]
 Weighted path: a tuple of string and number, e.g.
 ('foo',1.4)
 ('bar',-3)
 ('baz',0)
 Several paths: a list or a tuple of paths and/or weighted paths, e.g.
 ['foo', 'bar']
 ('foo', ('bar',5.0))
 ('foo', ('bar',5.0), 'baz', 'Foo', ('Bar',2.4))
 [('foo',-1), ('bar',0), ('baz',3.5)]
 A dictionary mapping strings to any of the above cases:
 {'foo':'foo', 'bar':('foo',1.4), 'baz':(('foo',-1),'BAZ')}
def hfst.fst_to_fsa (   fst,
  separator = '' 
)

Get a transducer (automaton) where each transition symbol pair isymbol:osymbol of fst is replaced with a transition isymbolosymbol:isymbolosymbol, adding separator between isymbol and osymbol.

Parameters
fstThe transducer.
separatorThe separator symbol inserted between input and output symbols.

Examples

 import hfst
 foo2bar = hfst.fst({'foo':'bar'})

creates a transducer [f:b o:a o:r]. Calling

 foobar = hfst.fst_to_fsa(foo2bar)

will create the transducer [fb:fb oa:oa or:or] and

 foobar = hfst.fst_to_fsa(foo2bar, '^')

the transducer [f^b:f^b o^a:o^a o^r:o^r].

See also
hfst.fsa_to_fst
def hfst.fst_type_to_string (   type)

Get a string representation of transducer implementation type type.

Parameters
typeAn hfst.ImplementationType.
def hfst.get_default_fst_type ( )

Get default transducer implementation type.

If the default type is not set, it defaults to hfst.ImplementationType.TROPICAL_OPENFST_TYPE

def hfst.is_diacritic (   symbol)

Whether symbol symbol is a flag diacritic.

Flag diacritics are of the form

 @[PNDRCU][.][A-Z]+([.][A-Z]+)?@
def hfst.read_att_input ( )

Read AT&T input from the user and return a transducer.

Returns
An HfstTransducer whose type is hfst.get_default_fst_type(). Read one AT&T line at a time from standard input and finally return an equivalent transducer. An empty line signals the end of input.
def hfst.read_att_string (   att)

Read a multiline string att and return a transducer.

Parameters
attA string in AT&& format that defines the transducer.
Returns
An HfstTransducer whose type is hfst.get_default_fst_type(). Read att and create a transducer as defined in it.
def hfst.read_att_transducer (   f,
  epsilonstr = hfst.EPSILON 
)

Read next transducer from AT&T file pointed by f.

epsilonstr defines the symbol used for epsilon in the file.

Parameters
fA python file
epsilonstrHow epsilon is represented in the file. By default, "@_EPSILON_SYMBOL_@" and "@0@" are both recognized.

If the file contains several transducers, they must be separated by "--" lines. In AT&T format, the transition lines are of the form:

 [0-9]+[\w]+[0-9]+[\w]+[^\w]+[\w]+[^\w]([\w]+(-)[0-9]+(\.[0-9]+))

and final state lines:

 [0-9]+[\w]+([\w]+(-)[0-9]+(\.[0-9]+))

If several transducers are listed in the same file, they are separated by lines of two consecutive hyphens "--". If the weight

 ([\w]+(-)[0-9]+(\.[0-9]+))

is missing, the transition or final state is given a zero weight.

NOTE: If transition symbols contains spaces, they must be escaped as '@_SPACE_@' because spaces are used as field separators. Both '@0@' and '@_EPSILON_SYMBOL_@' are always interpreted as epsilons.

An example:

 0      1      foo      bar      0.3
 1      0.5
 --
 0      0.0
 --
 --
 0      0.0
 0      0      a        <eps>    0.2

The example lists four transducers in AT&T format: one transducer accepting the string pair <'foo','bar'>, one epsilon transducer, one empty transducer and one transducer that accepts any number of 'a's and produces an empty string in all cases. The transducers can be read with the following commands (from a file named 'testfile.att'):

 transducers = []
 ifile = open('testfile.att', 'r')
 try:
     while (True):
         t = hfst.read_att_transducer(ifile, '<eps>')
         transducers.append(t)
         print("read one transducer")
 except hfst.exceptions.NotValidAttFormatException as e:
     print("Error reading transducer: not valid AT&T format.")
 except hfst.exceptions.EndOfStreamException as e:
     pass
 ifile.close()
 print("Read %i transducers in total" % len(transducers))

Epsilon will be represented as hfst.EPSILON in the resulting transducer. The argument epsilon_symbol only denotes how epsilons are represented in ifile.

Bug:
Empty transducers are in theory represented as empty strings in AT&T format. However, this sometimes results in them getting interpreted as end-of-file. To avoid this, use an empty line instead, i.e. a single newline character.
Exceptions
NotValidAttFormatException
StreamNotReadableException
StreamIsClosedException
EndOfStreamException
See also
#write_att
def hfst.read_prolog_transducer (   f)
def hfst.regex (   regexp,
  kwargs 
)

Get a transducer as defined by regular expression regexp.

Parameters
regexpThe regular expression defined with Xerox transducer notation.
kwargsArguments recognized are: error.
errorWhere warnings and errors are printed. Possible values are sys.stdout, sys.stderr (the default), a StringIO or None, indicating a quiet mode.

Regular expression operators:

 ~   complement
 \   term complement
 &   intersection
 -   minus

 $.  contains once
 $?  contains optionally
 $   contains once or more
 ( ) optionality

 +   Kleene plus
 *   Kleene star

 ./. ignore internally (not yet implemented)
 /   ignoring

 |   union

 <>  shuffle
 <   before
 >   after

 .o.   composition
 .O.   lenient composition
 .m>.  merge right
 .<m.  merge left
 .x.   cross product
 .P.   input priority union
 .p.   output priority union
 .-u.  input minus
 .-l.  output minus
 `[ ]  substitute

 ^n,k  catenate from n to k times, inclusive
 ^>n   catenate more than n times
 ^>n   catenate less than n times
 ^n    catenate n times

 .r   reverse
 .i   invert
 .u   input side
 .l   output side

 \\\  left quotient

 Two-level rules:

  \<=   left restriction
  <=>   left and right arrow
  <=    left arrow
  =>    right arrow

 Replace rules:

  ->    replace right
  (->)  optionally replace right
  <-    replace left
  (<-)  optionally replace left
  <->   replace left and right
  (<->) optionally replace left and right
  @->   left-to-right longest match
  @>    left-to-right shortest match
  ->@   right-to-left longest match
  >@    right-to-left shortest match

 Rule contexts, markers and separators:

  ||   match contexts on input sides
  //   match left context on output side and right context on input side
  \\   match left context on input side and right context on output side
  \/   match contexts on output sides
  _    center marker
  ...  markup marker
  ,,   rule separator in parallel rules
  ,    context separator
  [. .]  match epsilons only once

 Read from file:

  @bin" "  read binary transducer
  @txt" "  read transducer in att text format
  @stxt" " read spaced text
  @pl" "   read transducer in prolog text format
  @re" "   read regular expression

 Symbols:

  .#.  word boundary symbol in replacements, restrictions
  0    the epsilon
  ?    any token
  %    escape character
  { }  concatenate symbols
  " "  quote symbol

 :    pair separator
 ::   weight

 ;   end of expression
 !   starts a comment until end of line
 #   starts a comment until end of line
def hfst.set_default_fst_type (   impl)

Set the default implementation type.

Parameters
implAn hfst.ImplementationType.

Set the implementation type (SFST_TYPE, TROPICAL_OPENFST_TYPE, FOMA_TYPE) that is used by default by all operations that create transducers. The default value is TROPICAL_OPENFST_TYPE

def hfst.start_xfst (   kwargs)

Start interactive xfst compiler.

Parameters
kwargsArguments recognized are: type, quit_on_fail.
quit_on_failWhether the compiler exits on any error, defaults to False.
typeImplementation type of the compiler, defaults to hfst.get_default_fst_type().
def hfst.tokenized_fst (   arg,
  weight = 0 
)

Get a transducer that recognizes the concatenation of symbols or symbol pairs in arg.

Parameters
argThe symbols or symbol pairs that form the path to be recognized.

Example

 import hfst
 tok = hfst.HfstTokenizer()
 tok.add_multichar_symbol('foo')
 tok.add_multichar_symbol('bar')
 tr = hfst.tokenized_fst(tok.tokenize('foobar', 'foobaz'))

will create the transducer [foo:foo bar:b 0:a 0:z]

Variable Documentation

string EPSILON = '@_EPSILON_SYMBOL_@'

The string for epsilon symbol.

An example:

 fsm = hfst.HfstBasicTransducer()
 fsm.add_state(1)
 fsm.set_final_weight(1, 2.0)
 fsm.add_transition(0, 1, "foo", hfst.EPSILON)
 if not hfst.HfstTransducer(fsm).compare(hfst.regex('foo:0::2.0')):
     raise RuntimeError('')
Note
In regular expressions, "0" is used for the epsilon.
See also
Symbols
string IDENTITY = '@_IDENTITY_SYMBOL_@'

The string for identity symbol.

An example:

 fsm = hfst.HfstBasicTransducer()
 fsm.add_state(1)
 fsm.set_final_weight(1, 1.5)
 fsm.add_transition(0, 1, hfst.IDENTITY, hfst.IDENTITY)
 if not hfst.HfstTransducer(fsm).compare(hfst.regex('?::1.5')):
     raise RuntimeError('')
Note
In regular expressions, a single "?" is used for the identity symbol.
See also
Symbols
string UNKNOWN = '@_UNKNOWN_SYMBOL_@'

The string for unknown symbol.

An example:

 fsm = hfst.HfstBasicTransducer()
 fsm.add_state(1)
 fsm.set_final_weight(1, -0.5)
 fsm.add_transition(0, 1, "foo", hfst.UNKNOWN)
 fsm.add_transition(0, 1, "foo", "foo")
 if not hfst.HfstTransducer(fsm).compare(hfst.regex('foo:?::-0.5')):
     raise RuntimeError('')
Note
In regular expressions, "?" on either or both sides of a transition is used for the unknown symbol.
See also
Symbols