Package com.ibm.icu.charset
Class UTF8
- java.lang.Object
-
- com.ibm.icu.charset.UTF8
-
class UTF8 extends java.lang.Object
Partial Java port of ICU4C unicode/utf8.h and ustr_imp.h.
-
-
Field Summary
Fields Modifier and Type Field Description (package private) static int
MAX_LENGTH
4: The maximum number of UTF-8 code units (bytes) per Unicode code point (U+0000..U+10ffff).private static int[]
U8_LEAD3_T1_BITS
Internal bit vector for 3-byte UTF-8 validity check, for use inisValidLead3AndT1(int, byte)
.private static int[]
U8_LEAD4_T1_BITS
Internal bit vector for 4-byte UTF-8 validity check, for use inisValidLead4AndT1(int, byte)
.
-
Constructor Summary
Constructors Constructor Description UTF8()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description (package private) static int
countBytes(byte leadByte)
Counts the bytes of any whole valid sequence for a UTF-8 lead byte.(package private) static int
countTrailBytes(byte leadByte)
Counts the trail bytes for a UTF-8 lead byte.(package private) static boolean
isLead(byte c)
Is this code unit (byte) a UTF-8 lead byte?(package private) static boolean
isSingle(byte c)
Does this code unit (byte) encode a code point by itself (US-ASCII 0..0x7f)?(package private) static boolean
isTrail(byte c)
Is this code unit (byte) a UTF-8 trail byte? (0x80..0xBF)(package private) static boolean
isValidLead3AndT1(int lead, byte t1)
Internal 3-byte UTF-8 validity check.(package private) static boolean
isValidLead4AndT1(int lead, byte t1)
Internal 4-byte UTF-8 validity check.(package private) static boolean
isValidTrail(int prev, byte t, int i, int length)
Is t a valid UTF-8 trail byte?(package private) static int
length(int c)
How many code units (bytes) are used for the UTF-8 encoding of this Unicode code point?
-
-
-
Field Detail
-
U8_LEAD3_T1_BITS
private static final int[] U8_LEAD3_T1_BITS
Internal bit vector for 3-byte UTF-8 validity check, for use inisValidLead3AndT1(int, byte)
. Each bit indicates whether one lead byte + first trail byte pair starts a valid sequence. Lead byte E0..EF bits 3..0 are used as data int index, first trail byte bits 7..5 are used as bit index into that int.- See Also:
isValidLead3AndT1(int, byte)
-
U8_LEAD4_T1_BITS
private static final int[] U8_LEAD4_T1_BITS
Internal bit vector for 4-byte UTF-8 validity check, for use inisValidLead4AndT1(int, byte)
. Each bit indicates whether one lead byte + first trail byte pair starts a valid sequence. Lead byte F0..F4 bits 2..0 are used as data int index, first trail byte bits 7..4 are used as bit index into that int.- See Also:
isValidLead4AndT1(int, byte)
-
MAX_LENGTH
static int MAX_LENGTH
4: The maximum number of UTF-8 code units (bytes) per Unicode code point (U+0000..U+10ffff).
-
-
Method Detail
-
countTrailBytes
static int countTrailBytes(byte leadByte)
Counts the trail bytes for a UTF-8 lead byte. Returns 0 for 0..0xc1 as well as for 0xf5..0xff.- Parameters:
leadByte
- The first byte of a UTF-8 sequence. Must be 0..0xff.- Returns:
- 0..3
-
countBytes
static int countBytes(byte leadByte)
Counts the bytes of any whole valid sequence for a UTF-8 lead byte. Returns 1 for ASCII 0..0x7f. Returns 0 for 0x80..0xc1 as well as for 0xf5..0xff.- Parameters:
leadByte
- The first byte of a UTF-8 sequence. Must be 0..0xff.- Returns:
- 0..4
-
isValidLead3AndT1
static boolean isValidLead3AndT1(int lead, byte t1)
Internal 3-byte UTF-8 validity check.- Parameters:
lead
- E0..EFt1
- 00..FF- Returns:
- true if lead byte E0..EF and first trail byte 00..FF start a valid sequence.
-
isValidLead4AndT1
static boolean isValidLead4AndT1(int lead, byte t1)
Internal 4-byte UTF-8 validity check.- Parameters:
lead
- F0..F4t1
- 00..FF- Returns:
- true if lead byte F0..F4 and first trail byte 00..FF start a valid sequence.
-
isSingle
static boolean isSingle(byte c)
Does this code unit (byte) encode a code point by itself (US-ASCII 0..0x7f)?- Parameters:
c
- 8-bit code unit (byte)- Returns:
- true if c is an ASCII byte
-
isLead
static boolean isLead(byte c)
Is this code unit (byte) a UTF-8 lead byte?- Parameters:
c
- 8-bit code unit (byte)- Returns:
- true if c is a lead byte
-
isTrail
static boolean isTrail(byte c)
Is this code unit (byte) a UTF-8 trail byte? (0x80..0xBF)- Parameters:
c
- 8-bit code unit (byte)- Returns:
- true if c is a trail byte
-
length
static int length(int c)
How many code units (bytes) are used for the UTF-8 encoding of this Unicode code point?- Parameters:
c
- 32-bit code point- Returns:
- 1..4, or 0 if c is a surrogate or not a Unicode code point
-
isValidTrail
static boolean isValidTrail(int prev, byte t, int i, int length)
Is t a valid UTF-8 trail byte?- Parameters:
prev
- Must be the preceding lead byte if i==1 and length>=3; otherwise ignored.t
- The i-th byte following the lead byte.i
- The index (1..3) of byte t in the byte sequence. 0length
- The length (2..4) of the byte sequence according to the lead byte.- Returns:
- true if t is a valid trail byte in this context.
-
-