HFST - Helsinki Finite-State Transducer Technology - C++ API  version 3.9.1
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
Public Member Functions | Static Public Member Functions | List of all members
HfstTokenizer Class Reference

A tokenizer for creating transducers from UTF-8 strings. More...

#include <HfstTokenizer.h>

Public Member Functions

HFSTDLL void add_multichar_symbol (const std::string &symbol)
 Add a multicharacter symbol symbol to this tokenizer. More...
 
HFSTDLL void add_skip_symbol (const std::string &symbol)
 Add a symbol to be skipped to this tokenizer. More...
 
HFSTDLL HfstTokenizer ()
 Create a tokenizer that recognizes utf-8 symbols. More...
 
HFSTDLL StringPairVector tokenize (const std::string &input_string) const
 Tokenize the string input_string. More...
 
HFSTDLL StringPairVector tokenize (const std::string &input_string, const std::string &output_string) const
 Tokenize the string pair input_string : output_string. More...
 
HFSTDLL StringVector tokenize_one_level (const std::string &input_string) const
 Tokenize the string input_string. More...
 

Static Public Member Functions

static HFSTDLL void check_utf8_correctness (const std::string &input_string)
 If input_String is not valid utf-8, throw an IncorrectUtf8CodingException. More...
 

Detailed Description

A tokenizer for creating transducers from UTF-8 strings.

Strings are tokenized from left to right using longest match tokenization. For example, if the tokenizer contains a multicharacter symbol "foo" and a skip symbol "fo", the string "foo" is tokenized as "foo:foo". If the tokenizer contains a multicharacter symbol "fo" and a skip symbol "foo", the string "foo" is tokenized as an empty string.

An example:

      HfstTokenizer TOK;
      TOK.add_multichar_symbol("<br />");
      TOK.add_skip_symbol("<p>");
      TOK.add_skip_symbol("</p>");
      StringPairVector spv = TOK.tokenize("<p>A<br />paragraph!</p>");
      // spv now contains
      //    A:A <br />:<br /> p:p a:a r:r a:a g:g r:r a:a p:p h:h !:!
  @note The tokenizer only tokenizes utf-8 strings. 
  Special symbols (see #String) are not included in the tokenizer 
  unless added to it.

  @see hfst::HfstTransducer::HfstTransducer(const std::string&, const HfstTokenizer&, ImplementationType type)  

Constructor & Destructor Documentation

Create a tokenizer that recognizes utf-8 symbols.

Member Function Documentation

void add_multichar_symbol ( const std::string &  symbol)

Add a multicharacter symbol symbol to this tokenizer.

If a multicharacter symbol has a skip symbol inside it, it is not considered a multicharacter symbol. For example if we have a multicharacter symbol "foo" and a skip symbol "bar", the string "fobaro" will be tokenized "f" "o" "o", not "foo".

void add_skip_symbol ( const std::string &  symbol)

Add a symbol to be skipped to this tokenizer.

After skipping a symbol, tokenization is always started again. For example if we have a multicharacter symbol "foo" and a skip symbol "bar", the string "fobaro" will be tokenized "f" "o" "o", not "foo".

void check_utf8_correctness ( const std::string &  input_string)
static

If input_String is not valid utf-8, throw an IncorrectUtf8CodingException.

A string is non-valid if:

  • It contains one of the unsigned bytes 192, 193, 245, 246 and 247.
  • If it is not made up of sequences of one initial byte (0xxxxxxx, 110xxxxx, 1110xxxx or 11110xxx) followed by an appropriate number of continuation bytes (10xxxxxx).
    1. Initial bytes 0xxxxxxx represent ASCII chars and may not be followed by a continuation byte.
    2. Initial bytes 110xxxxx are followed by exactly one continuation byte.
    3. Initial bytes 1110xxxx are followed by exactly two continuation bytes.
    4. Initial bytes 11110xxx are followed by exactly three continuation bytes. (For reference: http://en.wikipedia.org/wiki/UTF-8)
StringPairVector tokenize ( const std::string &  input_string) const

Tokenize the string input_string.

StringPairVector tokenize ( const std::string &  input_string,
const std::string &  output_string 
) const

Tokenize the string pair input_string : output_string.

If one string has more tokens than the other, epsilons will be inserted to the end of the tokenized string with less tokens so that both tokenized strings have the same number of tokens.

StringVector tokenize_one_level ( const std::string &  input_string) const

Tokenize the string input_string.


The documentation for this class was generated from the following files: