public interface Encoding
| Modifier and Type | Method and Description |
|---|---|
int |
countTokens(java.lang.String text)
Encodes the given text into a list of token ids and returns the amount of tokens.
|
int |
countTokensOrdinary(java.lang.String text)
Encodes the given text into a list of token ids and returns the amount of tokens.
|
java.lang.String |
decode(java.util.List<java.lang.Integer> tokens)
Decodes the given list of token ids into a text.
|
byte[] |
decodeBytes(java.util.List<java.lang.Integer> tokens)
Decodes the given list of token ids into a byte array.
|
java.util.List<java.lang.Integer> |
encode(java.lang.String text)
Encodes the given text into a list of token ids.
|
java.util.List<java.lang.Integer> |
encodeOrdinary(java.lang.String text)
Encodes the given text into a list of token ids, ignoring special tokens.
|
java.lang.String |
getName()
Returns the name of this encoding.
|
java.util.List<java.lang.Integer> encode(java.lang.String text)
Special tokens are artificial tokens used to unlock capabilities from a model,
such as fill-in-the-middle. There is currently no support for parsing special tokens
in a text, so if the text contains special tokens, this method will throw an
UnsupportedOperationException.
If you want to encode special tokens as ordinary text, use encodeOrdinary(String).
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.encode("hello world");
// returns [15339, 1917]
encoding.encode("hello <|endoftext|> world");
// raises an UnsupportedOperationException
text - the text to encodejava.lang.UnsupportedOperationException - if the text contains special tokens which are not supported for nowjava.util.List<java.lang.Integer> encodeOrdinary(java.lang.String text)
This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.encodeOrdinary("hello world");
// returns [15339, 1917]
encoding.encodeOrdinary("hello <|endoftext|> world");
// returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]
text - the text to encodeint countTokens(java.lang.String text)
encode(String), if all you want is to
know the amount of tokens. It is not more performant than encode(String),
so prefer to use encode(String) if you actually need the tokens.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.countTokens("hello world");
// returns 2
encoding.countTokens("hello <|endoftext|> world");
// raises an UnsupportedOperationException
text - the text to count tokens forjava.lang.UnsupportedOperationException - if the text contains special tokens which are not supported for nowint countTokensOrdinary(java.lang.String text)
encodeOrdinary(String), if all you want is to
know the amount of tokens. It is not more performant than encodeOrdinary(String),
so prefer to use encodeOrdinary(String) if you actually need the tokens.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
encoding.countTokensOrdinary("hello world");
// returns 2
encoding.countTokensOrdinary("hello <|endoftext|> world");
// returns 8
text - the text to count tokens forjava.lang.UnsupportedOperationException - if the text contains special tokens which are not supported for nowjava.lang.String decode(java.util.List<java.lang.Integer> tokens)
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.decode(List.of(15339, 1917)); // returns "hello world" encoding.decode(List.of(15339, 1917, Integer.MAX_VALUE)); // raises an IllegalArgumentException
tokens - the list of token idsjava.lang.IllegalArgumentException - if the list contains invalid token idsbyte[] decodeBytes(java.util.List<java.lang.Integer> tokens)
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.decodeBytes(List.of(15339, 1917)); // returns [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] encoding.decodeBytes(List.of(15339, 1917, Integer.MAX_VALUE)); // raises an IllegalArgumentException
tokens - the list of token idsjava.lang.IllegalArgumentException - if the list contains invalid token idsjava.lang.String getName()
EncodingRegistry.