Class UTF8


  • class UTF8
    extends java.lang.Object
    Partial Java port of ICU4C unicode/utf8.h and ustr_imp.h.
    • Constructor Summary

      Constructors 
      Constructor Description
      UTF8()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      (package private) static int countBytes​(byte leadByte)
      Counts the bytes of any whole valid sequence for a UTF-8 lead byte.
      (package private) static int countTrailBytes​(byte leadByte)
      Counts the trail bytes for a UTF-8 lead byte.
      (package private) static boolean isLead​(byte c)
      Is this code unit (byte) a UTF-8 lead byte?
      (package private) static boolean isSingle​(byte c)
      Does this code unit (byte) encode a code point by itself (US-ASCII 0..0x7f)?
      (package private) static boolean isTrail​(byte c)
      Is this code unit (byte) a UTF-8 trail byte? (0x80..0xBF)
      (package private) static boolean isValidLead3AndT1​(int lead, byte t1)
      Internal 3-byte UTF-8 validity check.
      (package private) static boolean isValidLead4AndT1​(int lead, byte t1)
      Internal 4-byte UTF-8 validity check.
      (package private) static boolean isValidTrail​(int prev, byte t, int i, int length)
      Is t a valid UTF-8 trail byte?
      (package private) static int length​(int c)
      How many code units (bytes) are used for the UTF-8 encoding of this Unicode code point?
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • U8_LEAD3_T1_BITS

        private static final int[] U8_LEAD3_T1_BITS
        Internal bit vector for 3-byte UTF-8 validity check, for use in isValidLead3AndT1(int, byte). Each bit indicates whether one lead byte + first trail byte pair starts a valid sequence. Lead byte E0..EF bits 3..0 are used as data int index, first trail byte bits 7..5 are used as bit index into that int.
        See Also:
        isValidLead3AndT1(int, byte)
      • U8_LEAD4_T1_BITS

        private static final int[] U8_LEAD4_T1_BITS
        Internal bit vector for 4-byte UTF-8 validity check, for use in isValidLead4AndT1(int, byte). Each bit indicates whether one lead byte + first trail byte pair starts a valid sequence. Lead byte F0..F4 bits 2..0 are used as data int index, first trail byte bits 7..4 are used as bit index into that int.
        See Also:
        isValidLead4AndT1(int, byte)
      • MAX_LENGTH

        static int MAX_LENGTH
        4: The maximum number of UTF-8 code units (bytes) per Unicode code point (U+0000..U+10ffff).
    • Constructor Detail

      • UTF8

        UTF8()
    • Method Detail

      • countTrailBytes

        static int countTrailBytes​(byte leadByte)
        Counts the trail bytes for a UTF-8 lead byte. Returns 0 for 0..0xc1 as well as for 0xf5..0xff.
        Parameters:
        leadByte - The first byte of a UTF-8 sequence. Must be 0..0xff.
        Returns:
        0..3
      • countBytes

        static int countBytes​(byte leadByte)
        Counts the bytes of any whole valid sequence for a UTF-8 lead byte. Returns 1 for ASCII 0..0x7f. Returns 0 for 0x80..0xc1 as well as for 0xf5..0xff.
        Parameters:
        leadByte - The first byte of a UTF-8 sequence. Must be 0..0xff.
        Returns:
        0..4
      • isValidLead3AndT1

        static boolean isValidLead3AndT1​(int lead,
                                         byte t1)
        Internal 3-byte UTF-8 validity check.
        Parameters:
        lead - E0..EF
        t1 - 00..FF
        Returns:
        true if lead byte E0..EF and first trail byte 00..FF start a valid sequence.
      • isValidLead4AndT1

        static boolean isValidLead4AndT1​(int lead,
                                         byte t1)
        Internal 4-byte UTF-8 validity check.
        Parameters:
        lead - F0..F4
        t1 - 00..FF
        Returns:
        true if lead byte F0..F4 and first trail byte 00..FF start a valid sequence.
      • isSingle

        static boolean isSingle​(byte c)
        Does this code unit (byte) encode a code point by itself (US-ASCII 0..0x7f)?
        Parameters:
        c - 8-bit code unit (byte)
        Returns:
        true if c is an ASCII byte
      • isLead

        static boolean isLead​(byte c)
        Is this code unit (byte) a UTF-8 lead byte?
        Parameters:
        c - 8-bit code unit (byte)
        Returns:
        true if c is a lead byte
      • isTrail

        static boolean isTrail​(byte c)
        Is this code unit (byte) a UTF-8 trail byte? (0x80..0xBF)
        Parameters:
        c - 8-bit code unit (byte)
        Returns:
        true if c is a trail byte
      • length

        static int length​(int c)
        How many code units (bytes) are used for the UTF-8 encoding of this Unicode code point?
        Parameters:
        c - 32-bit code point
        Returns:
        1..4, or 0 if c is a surrogate or not a Unicode code point
      • isValidTrail

        static boolean isValidTrail​(int prev,
                                    byte t,
                                    int i,
                                    int length)
        Is t a valid UTF-8 trail byte?
        Parameters:
        prev - Must be the preceding lead byte if i==1 and length>=3; otherwise ignored.
        t - The i-th byte following the lead byte.
        i - The index (1..3) of byte t in the byte sequence. 0
        length - The length (2..4) of the byte sequence according to the lead byte.
        Returns:
        true if t is a valid trail byte in this context.