Class Utf8

java.lang.Object
com.google.protobuf.Utf8

final class Utf8 extends Object
A set of low-level, high-performance static utility methods related to the UTF-8 character encoding. This class has no dependencies outside of the core JDK libraries.

There are several variants of UTF-8. The one implemented by this class is the restricted definition of UTF-8 introduced in Unicode 3.1, which mandates the rejection of "overlong" byte sequences as well as rejection of 3-byte surrogate codepoint byte sequences. Note that the UTF-8 decoder included in Oracle's JDK has been modified to also reject "overlong" byte sequences, but (as of 2011) still accepts 3-byte surrogate codepoint byte sequences.

The byte sequences considered valid by this class are exactly those that can be roundtrip converted to Strings and back to bytes using the UTF-8 charset, without loss:


 Arrays.equals(bytes, new String(bytes, Internal.UTF_8).getBytes(Internal.UTF_8))
 

See the Unicode Standard,
Table 3-6. UTF-8 Bit Distribution,
Table 3-7. Well Formed UTF-8 Byte Sequences.

This class supports decoding of partial byte sequences, so that the bytes in a complete UTF-8 byte sequences can be stored in multiple segments. Methods typically return MALFORMED if the partial byte sequence is definitely not well-formed, COMPLETE if it is well-formed in the absence of additional input, or if the byte sequence apparently terminated in the middle of a character, an opaque integer "state" value containing enough information to decode the character when passed to a subsequent invocation of a partial decoding method.

  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    private static class 
    Utility methods for decoding bytes into String.
    (package private) static class 
    A processor of UTF-8 strings, providing methods for checking validity and encoding.
    (package private) static final class 
    Utf8.Processor implementation that does not use any sun.misc.Unsafe methods.
    (package private) static class 
     
    (package private) static final class 
    Utf8.Processor that uses sun.misc.Unsafe where possible to improve performance.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private static final long
    A mask used when performing unsafe reads to determine if a long value contains any non-ASCII characters (i.e.
    static final int
    State value indicating that the byte sequence is well-formed and complete (no further bytes are needed to complete a character).
    static final int
    State value indicating that the byte sequence is definitely not well-formed.
    (package private) static final int
    Maximum number of bytes per Java UTF-16 char in UTF-8.
    private static final Utf8.Processor
    UTF-8 is a runtime hot spot so we attempt to provide heavily optimized implementations depending on what is available on the platform.
    private static final int
    Used by Unsafe UTF-8 string validation logic to determine the minimum string length above which to employ an optimized algorithm for counting ASCII characters.
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    private
     
  • Method Summary

    Modifier and Type
    Method
    Description
    (package private) static String
    decodeUtf8(byte[] bytes, int index, int size)
    Decodes the given UTF-8 encoded byte array slice into a String.
    (package private) static String
    decodeUtf8(ByteBuffer buffer, int index, int size)
    Decodes the given UTF-8 portion of the ByteBuffer into a String.
    (package private) static int
    encode(CharSequence in, byte[] out, int offset, int length)
     
    (package private) static int
    Returns the number of bytes in the UTF-8-encoded form of sequence.
    private static int
    encodedLengthGeneral(CharSequence sequence, int start)
     
    (package private) static void
    Encodes the given characters to the target ByteBuffer using UTF-8 encoding.
    private static int
    estimateConsecutiveAscii(ByteBuffer buffer, int index, int limit)
    Counts (approximately) the number of consecutive ASCII characters in the given buffer.
    private static int
    incompleteStateFor(byte[] bytes, int index, int limit)
     
    private static int
    incompleteStateFor(int byte1)
     
    private static int
    incompleteStateFor(int byte1, int byte2)
     
    private static int
    incompleteStateFor(int byte1, int byte2, int byte3)
     
    private static int
    incompleteStateFor(ByteBuffer buffer, int byte1, int index, int remaining)
     
    static boolean
    isValidUtf8(byte[] bytes)
    Returns true if the given byte array is a well-formed UTF-8 byte sequence.
    static boolean
    isValidUtf8(byte[] bytes, int index, int limit)
    Returns true if the given byte array slice is a well-formed UTF-8 byte sequence.
    (package private) static boolean
    Determines if the given ByteBuffer is a valid UTF-8 string.
    static int
    partialIsValidUtf8(int state, byte[] bytes, int index, int limit)
    Tells whether the given byte array slice is a well-formed, malformed, or incomplete UTF-8 byte sequence.
    (package private) static int
    partialIsValidUtf8(int state, ByteBuffer buffer, int index, int limit)
    Determines if the given ByteBuffer is a partially valid UTF-8 string.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • processor

      private static final Utf8.Processor processor
      UTF-8 is a runtime hot spot so we attempt to provide heavily optimized implementations depending on what is available on the platform. The processor is the platform-optimized delegate for which all methods are delegated directly to.
    • ASCII_MASK_LONG

      private static final long ASCII_MASK_LONG
      A mask used when performing unsafe reads to determine if a long value contains any non-ASCII characters (i.e. any byte >= 0x80).
      See Also:
    • MAX_BYTES_PER_CHAR

      static final int MAX_BYTES_PER_CHAR
      Maximum number of bytes per Java UTF-16 char in UTF-8.
      See Also:
    • COMPLETE

      public static final int COMPLETE
      State value indicating that the byte sequence is well-formed and complete (no further bytes are needed to complete a character).
      See Also:
    • MALFORMED

      public static final int MALFORMED
      State value indicating that the byte sequence is definitely not well-formed.
      See Also:
    • UNSAFE_COUNT_ASCII_THRESHOLD

      private static final int UNSAFE_COUNT_ASCII_THRESHOLD
      Used by Unsafe UTF-8 string validation logic to determine the minimum string length above which to employ an optimized algorithm for counting ASCII characters. The reason for this threshold is that for small strings, the optimization may not be beneficial or may even negatively impact performance since it requires additional logic to avoid unaligned reads (when calling Unsafe.getLong). This threshold guarantees that even if the initial offset is unaligned, we're guaranteed to make at least one call to Unsafe.getLong() which provides a performance improvement that entirely subsumes the cost of the additional logic.
      See Also:
  • Constructor Details

    • Utf8

      private Utf8()
  • Method Details

    • isValidUtf8

      public static boolean isValidUtf8(byte[] bytes)
      Returns true if the given byte array is a well-formed UTF-8 byte sequence.

      This is a convenience method, equivalent to a call to isValidUtf8(bytes, 0, bytes.length).

    • isValidUtf8

      public static boolean isValidUtf8(byte[] bytes, int index, int limit)
      Returns true if the given byte array slice is a well-formed UTF-8 byte sequence. The range of bytes to be checked extends from index index, inclusive, to limit, exclusive.

      This is a convenience method, equivalent to partialIsValidUtf8(bytes, index, limit) == Utf8.COMPLETE.

    • partialIsValidUtf8

      public static int partialIsValidUtf8(int state, byte[] bytes, int index, int limit)
      Tells whether the given byte array slice is a well-formed, malformed, or incomplete UTF-8 byte sequence. The range of bytes to be checked extends from index index, inclusive, to limit, exclusive.
      Parameters:
      state - either COMPLETE (if this is the initial decoding operation) or the value returned from a call to a partial decoding method for the previous bytes
      Returns:
      MALFORMED if the partial byte sequence is definitely not well-formed, COMPLETE if it is well-formed (no additional input needed), or if the byte sequence is "incomplete", i.e. apparently terminated in the middle of a character, an opaque integer "state" value containing enough information to decode the character when passed to a subsequent invocation of a partial decoding method.
    • incompleteStateFor

      private static int incompleteStateFor(int byte1)
    • incompleteStateFor

      private static int incompleteStateFor(int byte1, int byte2)
    • incompleteStateFor

      private static int incompleteStateFor(int byte1, int byte2, int byte3)
    • incompleteStateFor

      private static int incompleteStateFor(byte[] bytes, int index, int limit)
    • incompleteStateFor

      private static int incompleteStateFor(ByteBuffer buffer, int byte1, int index, int remaining)
    • encodedLength

      static int encodedLength(CharSequence sequence)
      Returns the number of bytes in the UTF-8-encoded form of sequence. For a string, this method is equivalent to string.getBytes(UTF_8).length, but is more efficient in both time and space.
      Throws:
      IllegalArgumentException - if sequence contains ill-formed UTF-16 (unpaired surrogates)
    • encodedLengthGeneral

      private static int encodedLengthGeneral(CharSequence sequence, int start)
    • encode

      static int encode(CharSequence in, byte[] out, int offset, int length)
    • isValidUtf8

      static boolean isValidUtf8(ByteBuffer buffer)
      Determines if the given ByteBuffer is a valid UTF-8 string.

      Selects an optimal algorithm based on the type of ByteBuffer (i.e. heap or direct) and the capabilities of the platform.

      Parameters:
      buffer - the buffer to check.
      See Also:
    • partialIsValidUtf8

      static int partialIsValidUtf8(int state, ByteBuffer buffer, int index, int limit)
      Determines if the given ByteBuffer is a partially valid UTF-8 string.

      Selects an optimal algorithm based on the type of ByteBuffer (i.e. heap or direct) and the capabilities of the platform.

      Parameters:
      buffer - the buffer to check.
      See Also:
    • decodeUtf8

      static String decodeUtf8(ByteBuffer buffer, int index, int size) throws InvalidProtocolBufferException
      Decodes the given UTF-8 portion of the ByteBuffer into a String.
      Throws:
      InvalidProtocolBufferException - if the input is not valid UTF-8.
    • decodeUtf8

      static String decodeUtf8(byte[] bytes, int index, int size) throws InvalidProtocolBufferException
      Decodes the given UTF-8 encoded byte array slice into a String.
      Throws:
      InvalidProtocolBufferException - if the input is not valid UTF-8.
    • encodeUtf8

      static void encodeUtf8(CharSequence in, ByteBuffer out)
      Encodes the given characters to the target ByteBuffer using UTF-8 encoding.

      Selects an optimal algorithm based on the type of ByteBuffer (i.e. heap or direct) and the capabilities of the platform.

      Parameters:
      in - the source string to be encoded
      out - the target buffer to receive the encoded string.
      See Also:
    • estimateConsecutiveAscii

      private static int estimateConsecutiveAscii(ByteBuffer buffer, int index, int limit)
      Counts (approximately) the number of consecutive ASCII characters in the given buffer. The byte order of the ByteBuffer does not matter, so performance can be improved if native byte order is used (i.e. no byte-swapping in ByteBuffer.getLong(int)).
      Parameters:
      buffer - the buffer to be scanned for ASCII chars
      index - the starting index of the scan
      limit - the limit within buffer for the scan
      Returns:
      the number of ASCII characters found. The stopping position will be at or before the first non-ASCII byte.