HFST - Helsinki Finite-State Transducer Technology - C++ API
version 3.9.1
|
A tokenizer for creating transducers from UTF-8 strings. More...
#include <HfstTokenizer.h>
Public Member Functions | |
HFSTDLL void | add_multichar_symbol (const std::string &symbol) |
Add a multicharacter symbol symbol to this tokenizer. More... | |
HFSTDLL void | add_skip_symbol (const std::string &symbol) |
Add a symbol to be skipped to this tokenizer. More... | |
HFSTDLL | HfstTokenizer () |
Create a tokenizer that recognizes utf-8 symbols. More... | |
HFSTDLL StringPairVector | tokenize (const std::string &input_string) const |
Tokenize the string input_string. More... | |
HFSTDLL StringPairVector | tokenize (const std::string &input_string, const std::string &output_string) const |
Tokenize the string pair input_string : output_string. More... | |
HFSTDLL StringVector | tokenize_one_level (const std::string &input_string) const |
Tokenize the string input_string. More... | |
Static Public Member Functions | |
static HFSTDLL void | check_utf8_correctness (const std::string &input_string) |
If input_String is not valid utf-8, throw an IncorrectUtf8CodingException. More... | |
A tokenizer for creating transducers from UTF-8 strings.
Strings are tokenized from left to right using longest match tokenization. For example, if the tokenizer contains a multicharacter symbol "foo" and a skip symbol "fo", the string "foo" is tokenized as "foo:foo". If the tokenizer contains a multicharacter symbol "fo" and a skip symbol "foo", the string "foo" is tokenized as an empty string.
An example:
HfstTokenizer TOK; TOK.add_multichar_symbol("<br />"); TOK.add_skip_symbol("<p>"); TOK.add_skip_symbol("</p>"); StringPairVector spv = TOK.tokenize("<p>A<br />paragraph!</p>"); // spv now contains // A:A <br />:<br /> p:p a:a r:r a:a g:g r:r a:a p:p h:h !:!
@note The tokenizer only tokenizes utf-8 strings. Special symbols (see #String) are not included in the tokenizer unless added to it. @see hfst::HfstTransducer::HfstTransducer(const std::string&, const HfstTokenizer&, ImplementationType type)
HfstTokenizer | ( | ) |
Create a tokenizer that recognizes utf-8 symbols.
void add_multichar_symbol | ( | const std::string & | symbol | ) |
Add a multicharacter symbol symbol to this tokenizer.
If a multicharacter symbol has a skip symbol inside it, it is not considered a multicharacter symbol. For example if we have a multicharacter symbol "foo" and a skip symbol "bar", the string "fobaro" will be tokenized "f" "o" "o", not "foo".
void add_skip_symbol | ( | const std::string & | symbol | ) |
Add a symbol to be skipped to this tokenizer.
After skipping a symbol, tokenization is always started again. For example if we have a multicharacter symbol "foo" and a skip symbol "bar", the string "fobaro" will be tokenized "f" "o" "o", not "foo".
|
static |
If input_String is not valid utf-8, throw an IncorrectUtf8CodingException.
A string is non-valid if:
StringPairVector tokenize | ( | const std::string & | input_string | ) | const |
Tokenize the string input_string.
StringPairVector tokenize | ( | const std::string & | input_string, |
const std::string & | output_string | ||
) | const |
Tokenize the string pair input_string : output_string.
If one string has more tokens than the other, epsilons will be inserted to the end of the tokenized string with less tokens so that both tokenized strings have the same number of tokens.
StringVector tokenize_one_level | ( | const std::string & | input_string | ) | const |
Tokenize the string input_string.