Interface Tokenizer

All Known Implementing Classes:
SimpleTokenizer

public interface Tokenizer
Language-sensitive tokenization of a text string.
Author:
Mathias Mølster Lidal
  • Method Details

    • tokenize

      @Deprecated default Iterable<Token> tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)
      Deprecated.
      use #tokenize(String, LinguisticsParameters)
      Tokenizes the given input string.
      Parameters:
      input - the string to tokenize. May be arbitrarily large.
      language - the language of the input string.
      stemMode - the stem mode applied on the returned tokens
      removeAccents - whether to normalize accents and similar
      Returns:
      the tokens of the input String
      Throws:
      ProcessingException - If the underlying library throws an Exception.
    • tokenize

      default Iterable<Token> tokenize(String input, LinguisticsParameters parameters)
      Tokenizes the given input string. This default implementation calls tokenize(input, language, stemMode, removeAccents)