All Classes and Interfaces
Class
Description
Determines the class of a given character.
CharacterUtils provides a unified interface to Character-related
operations to implement backwards compatible character operations.A simple IO buffer to use with
CharacterUtils.fill(CharacterBuffer, Reader).A chunker converts splits a text string into multiple smaller strings (chunks).
Default implementation of
SignificanceModelRegistry.Exception that is thrown when detection fails.
Abstract superclass of all Detectors used for language and encoding detection.
An embedder converts a text string to a tensor
Runtime that is injectable through
Embedder constructor.Generates field values given an input text.
A chunker which splits a text into chunks at the first double non-letter/digit character after a given
target chunk length measured in characters (or precisely at that length, for CJK languages).
A class which splits consecutive word character sequences into overlapping character n-grams.
An immutable start index and length pair
A hint that can be given to a
Detector.Context of an invocation of a component carrying out a processing task.
A stemmer implementing the Kstem algorithm by Bob Krovetz.
Factory of linguistic processors.
This class provides a case normalization operation to be used e.g. when
document search should be case-insensitive.
Parameters to a linguistics operation.
This interface provides NFKC normalization of Strings through the underlying linguistics library.
A StringBuilder that allows one to access the array.
Exception class indicating that a fatal error occured during linguistic processing.
A segmenter splits a string into separate segments (such as words) without applying any further
processing (such as stemming) on each segment.
A chunker which splits a text into sentences.
Includes functionality for determining the langCode from a sample or from the encoding.
Factory of simple linguistic processor implementations.
A tokenizer which splits on whitespace, normalizes and transforms using the given implementations
and stems using the kstem algorithm.
Converts all accented characters into their de-accented counterparts followed by their combining diacritics, then
strips off the diacritics using a regex.
Immutable named lists of "special tokens" - strings which should override the normal tokenizer semantics
and be tokenized into a single token.
An immutable list of special tokens - strings which should override the normal tokenizer semantics
and be tokenized into a single token.
An immutable special token
A list of strings which does not allow for duplicate elements.
Interface providing stemming of single words.
An enum of the stemming modes which can be requested.
A single token produced by the tokenizer.
Language-sensitive tokenization of a text string.
List of token scripts (e.g. latin, japanese, chinese, etc.) which may warrant different
linguistics treatment.
An enumeration of token types.
Interface for providers of text transformations such as accent removal.