Class FixedLengthChunker

java.lang.Object
ai.vespa.language.chunker.FixedLengthChunker
All Implemented Interfaces:
Chunker

public class FixedLengthChunker extends Object implements Chunker
A chunker which splits a text into chunks at the first double non-letter/digit character after a given target chunk length measured in characters (or precisely at that length, for CJK languages). If there are no double non-letter/digit characters within 5% above the target length, the chunk split will be at the first single non-letter/digit character. If there are no double non-letter/digit characters within 10% above the target length, the chunk split will be at that position, so the absolute max chunk length will be 10% above the target length. The given target chunk length is adjusted down to prefer a more even chunk length distribution to account for the fact that the text length will typically not be an integer multiple of the target chunk length.
Author:
bratseth
  • Constructor Details

    • FixedLengthChunker

      public FixedLengthChunker()
  • Method Details

    • chunk

      public List<Chunker.Chunk> chunk(String inputText, Chunker.Context context)
      Description copied from interface: Chunker
      Splits a text into multiple chunks. The chunks should preferably contain all the content of the original text, and can be overlapping.
      Specified by:
      chunk in interface Chunker
      Parameters:
      inputText - the text to split into chunks
      context - the context which may influence a chunker's behavior
      Returns:
      the resulting chunks