Package ai.vespa.language.chunker
Class FixedLengthChunker
java.lang.Object
ai.vespa.language.chunker.FixedLengthChunker
- All Implemented Interfaces:
Chunker
A chunker which splits a text into chunks at the first double non-letter/digit character after a given
target chunk length measured in characters (or precisely at that length, for CJK languages).
If there are no double non-letter/digit characters within 5% above the target length,
the chunk split will be at the first single non-letter/digit character.
If there are no double non-letter/digit characters within 10% above the target length,
the chunk split will be at that position, so the absolute max chunk length will be 10% above the target
length.
The given target chunk length is adjusted down to prefer a more even chunk length distribution to account for the
fact that the text length will typically not be an integer multiple of the target chunk length.
- Author:
- bratseth
-
Nested Class Summary
Nested classes/interfaces inherited from interface com.yahoo.language.process.Chunker
Chunker.Chunk, Chunker.Context, Chunker.FailingChunker -
Field Summary
Fields inherited from interface com.yahoo.language.process.Chunker
defaultChunkerId, throwsOnUse -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionchunk(String inputText, Chunker.Context context) Splits a text into multiple chunks.
-
Constructor Details
-
FixedLengthChunker
public FixedLengthChunker()
-
-
Method Details
-
chunk
Description copied from interface:ChunkerSplits a text into multiple chunks. The chunks should preferably contain all the content of the original text, and can be overlapping.
-