Package com.yahoo.text
Class Utf8
java.lang.Object
com.yahoo.text.Utf8
Utility class with functions for handling UTF-8
- Author:
- arnej27959, Steinar Knutsen, baldersheim
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic intbyteCount(CharSequence str) Count the number of bytes needed to represent a given sequence of 16-bit char values as a UTF-8 encoded array.static intbyteCount(CharSequence str, int offset, int length) Count the number of bytes needed to represent a given sequence of 16-bit char values as a UTF-8 encoded array.static int[]Returns an integer array the length as the input string plus one.static int[]calculateStringPositions(byte[] utf8) Returns an array of the same length as the input array plus one.static intcodePointAsUtf8Length(int codepoint) Return the number of octets needed to encode a valid Unicode codepoint as UTF-8.static byte[]encode(int codepoint) Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into a new allocated array.static intencode(int codepoint, byte[] destination, int offset) Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into an array.static intencode(int codepoint, OutputStream destination) Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into an OutputStream.static voidencode(int codepoint, ByteBuffer destination) Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into a ByteBuffer.static CharsetReturns the Charset instance for UTF-8static CharsetEncoderCreate a new UTF-8 encoder.static byte[]toAsciiBytes(boolean v) static byte[]toAsciiBytes(long l) Encode a long as its decimal representation, i.e. toAsciiBytes(15L) will return "15" encoded as UTF-8.static byte[]Encode a UTF-8 string.static byte[]Utility method as toBytes(String).static intDirect encoding of a String into an array.static voidtoBytes(String src, int srcOffset, int srcLen, ByteBuffer dst, CharsetEncoder encoder) Encode a string directly into a ByteBuffer instance.static byte[]toBytesStd(String str) Uses String.getBytes directly.static StringtoString(byte[] utf8) Decode a UTF-8 string.static StringtoString(byte[] data, int offset, int length) Utility method as toString(byte[]).static StringtoString(ByteBuffer data) Fetch a string from a ByteBuffer instance.static StringtoStringStd(byte[] data) To be used instead of String.String(byte[] bytes)static inttotalBytes(byte firstByte) Inspects a byte assumed to be the first byte in a UTF8 to check how many bytes in total the sequence of bytes will use.static intunitCount(byte firstByte) Calculate the number of Unicode code units ("UTF-16 characters") needed to represent a given UTF-8 encoded code point.static intunitCount(byte[] utf8) Count the number of Unicode code units ("UTF-16 characters") needed to represent a given array of UTF-8 characters.static intunitCount(byte[] utf8, int offset, int length) Count the number of Unicode code units ("UTF-16 characters") needed to represent a given array of UTF-8 characters.
-
Constructor Details
-
Utf8
public Utf8()
-
-
Method Details
-
getCharset
Returns the Charset instance for UTF-8 -
toStringStd
To be used instead of String.String(byte[] bytes) -
toString
Utility method as toString(byte[]).- Parameters:
data- bytes to decodeoffset- index of first byte to decodelength- number of bytes to decode- Returns:
- String decoded from UTF-8
-
toString
Fetch a string from a ByteBuffer instance. ByteBuffer instances are stateful, so it is assumed to caller manipulates the instance's limit if the entire buffer is not a string.- Parameters:
data- The UTF-8 data source- Returns:
- a decoded String
-
toBytesStd
Uses String.getBytes directly. -
toAsciiBytes
public static byte[] toAsciiBytes(long l) Encode a long as its decimal representation, i.e. toAsciiBytes(15L) will return "15" encoded as UTF-8. In other words it is an optimized version of String.valueOf() followed by UTF-8 encoding. Avoid going through string in order to get a simple UTF-8 sequence.- Parameters:
l- value to represent as a decimal number encded as utf8- Returns:
- byte array
-
toAsciiBytes
public static byte[] toAsciiBytes(boolean v) -
toBytes
Encode a UTF-8 string.- Parameters:
string- The string to encode.- Returns:
- Utf8 encoded array
-
toString
Decode a UTF-8 string.- Parameters:
utf8- the bytes to decode- Returns:
- Utf8 encoded array
-
toBytes
Utility method as toBytes(String).- Parameters:
str- String to encodeoffset- index of first character to encodelength- number of characters to encode- Returns:
- substring encoded as UTF-8
-
toBytes
Direct encoding of a String into an array.- Parameters:
str- string to encodesrcOffset- index of first character in string to encodesrcLen- number of characters in string to encodedst- destination for encoded datadstOffset- index of first position to write data- Returns:
- the number of bytes written to the array.
-
toBytes
public static void toBytes(String src, int srcOffset, int srcLen, ByteBuffer dst, CharsetEncoder encoder) Encode a string directly into a ByteBuffer instance.This method is somewhat more cumbersome than the rest of the helper methods in this library, as it is intended for use cases in the following style, if extraneous copying is highly undesirable:
String[] a = {"abc", "def", "ghiè"}; int[] aLens = {3, 3, 5}; CharsetEncoder ce = Utf8.getNewEncoder(); ByteBuffer forWire = ByteBuffer.allocate(someNumber); for (int i = 0; i < a.length; i++) { forWire.putInt(aLens[i]); Utf8.toBytes(a[i], 0, a[i].length(), forWire, ce); }- Parameters:
src- the string to encodesrcOffset- index of first character to encodesrcLen- number of characters to encodedst- the destination ByteBufferencoder- the character encoder to use- See Also:
-
getNewEncoder
Create a new UTF-8 encoder. -
byteCount
Count the number of bytes needed to represent a given sequence of 16-bit char values as a UTF-8 encoded array. This method is written to be cheap to invoke. Note: It is strongly assumed to character sequence is valid. -
byteCount
Count the number of bytes needed to represent a given sequence of 16-bit char values as a UTF-8 encoded array. This method is written to be cheap to invoke. Note: It is strongly assumed to character sequence is valid. -
unitCount
public static int unitCount(byte[] utf8) Count the number of Unicode code units ("UTF-16 characters") needed to represent a given array of UTF-8 characters. This method is written to be cheap to invoke. Note: It is strongly assumed the sequence is valid. -
unitCount
public static int unitCount(byte[] utf8, int offset, int length) Count the number of Unicode code units ("UTF-16 characters") needed to represent a given array of UTF-8 characters. This method is written to be cheap to invoke. Note: It is strongly assumed the sequence is valid.- Parameters:
utf8- raw dataoffset- index of first byte of UTF-8 sequence to checklength- number of bytes in the UTF-8 sequence to check
-
unitCount
public static int unitCount(byte firstByte) Calculate the number of Unicode code units ("UTF-16 characters") needed to represent a given UTF-8 encoded code point.- Parameters:
firstByte- the first byte of a character encoded as UTF-8- Returns:
- the number of UTF-16 code units needed to represent the given code point
-
totalBytes
public static int totalBytes(byte firstByte) Inspects a byte assumed to be the first byte in a UTF8 to check how many bytes in total the sequence of bytes will use.- Parameters:
firstByte- the first byte of a UTF8 encoded character- Returns:
- the number of bytes used to encode the character
-
calculateBytePositions
Returns an integer array the length as the input string plus one. For every index in the array, the corresponding value gives the index into the UTF-8 byte sequence that can be created from the input.- Parameters:
value- a String to generate UTF-8 byte indexes from- Returns:
- an array containing corresponding UTF-8 byte indexes
-
calculateStringPositions
public static int[] calculateStringPositions(byte[] utf8) Returns an array of the same length as the input array plus one. For every index in the array, the corresponding value gives the index into the Java string (UTF-16 sequence) that can be created from the input.- Parameters:
utf8- a byte array containing a string encoded as UTF-8. Note: It is strongly assumed that this sequence is correct.- Returns:
- an array containing corresponding UTF-16 character indexes. If input array is empty, returns an array containg a single zero.
-
encode
public static byte[] encode(int codepoint) Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into a new allocated array.- Parameters:
codepoint- Unicode codepoint to encode- Returns:
- number of bytes written
- Throws:
IndexOutOfBoundsException- if there is insufficient room for the encoded data in the given array
-
encode
public static int encode(int codepoint, byte[] destination, int offset) Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into an array.- Parameters:
codepoint- Unicode codepoint to encodedestination- array to write intooffset- index of first byte written- Returns:
- index of the first byte after the last byte written (i.e. offset plus number of bytes written)
- Throws:
IndexOutOfBoundsException- if there is insufficient room for the encoded data in the given array
-
encode
Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into a ByteBuffer.- Parameters:
codepoint- Unicode codepoint to encodedestination- buffer to write into- Throws:
BufferOverflowException- if the buffer's limit is met while writing (propagated from the ByteBuffer)ReadOnlyBufferException- if the buffer is read only (propagated from the ByteBuffer)
-
encode
Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into an OutputStream.- Parameters:
codepoint- Unicode codepoint to encodedestination- buffer to write into- Returns:
- number of bytes written
- Throws:
IOException- propagated from stream
-
codePointAsUtf8Length
public static int codePointAsUtf8Length(int codepoint) Return the number of octets needed to encode a valid Unicode codepoint as UTF-8.- Parameters:
codepoint- the Unicode codepoint to inspect- Returns:
- the number of bytes needed for UTF-8 representation
-