Class Trie2

  • All Implemented Interfaces:
    java.lang.Iterable<Trie2.Range>
    Direct Known Subclasses:
    Trie2_16, Trie2_32, Trie2Writable

    public abstract class Trie2
    extends java.lang.Object
    implements java.lang.Iterable<Trie2.Range>
    This is the interface and common implementation of a Unicode Trie2. It is a kind of compressed table that maps from Unicode code points (0..0x10ffff) to 16- or 32-bit integer values. It works best when there are ranges of characters with the same value, which is generally the case with Unicode character properties. This is the second common version of a Unicode trie (hence the name Trie2).
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      class  Trie2.CharSequenceIterator
      An iterator that operates over an input CharSequence, and for each Unicode code point in the input returns the associated value from the Trie2.
      static class  Trie2.CharSequenceValues
      Struct-like class for holding the results returned by a UTrie2 CharSequence iterator.
      static class  Trie2.Range
      When iterating over the contents of a Trie2, Elements of this type are produced.
      (package private) class  Trie2.Trie2Iterator
      Implementation class for an iterator over a Trie2.
      (package private) static class  Trie2.UTrie2Header
      Trie2 data structure in serialized form: UTrie2Header header; uint16_t index[header.index2Length]; uint16_t data[header.shiftedDataLength<<2]; -- or uint32_t data[...] For Java, this is read from the stream into an instance of UTrie2Header.
      static interface  Trie2.ValueMapper
      When iterating over the contents of a Trie2, an instance of TrieValueMapper may be used to remap the values from the Trie2.
      (package private) static class  Trie2.ValueWidth
      Selectors for the width of a UTrie2 data value.
    • Constructor Summary

      Constructors 
      Constructor Description
      Trie2()  
    • Method Summary

      All Methods Static Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      Trie2.CharSequenceIterator charSequenceIterator​(java.lang.CharSequence text, int index)
      Create an iterator that will produce the values from the Trie2 for the sequence of code points in an input text.
      static Trie2 createFromSerialized​(java.nio.ByteBuffer bytes)
      Create a Trie2 from its serialized form.
      boolean equals​(java.lang.Object other)
      Equals function.
      abstract int get​(int codePoint)
      Get the value for a code point as stored in the Trie2.
      abstract int getFromU16SingleLead​(char c)
      Get the trie value for a UTF-16 code unit.
      static int getVersion​(java.io.InputStream is, boolean littleEndianOk)
      Get the UTrie version from an InputStream containing the serialized form of either a Trie (version 1) or a Trie2 (version 2).
      private static int hashByte​(int h, int b)  
      int hashCode()  
      private static int hashInt​(int h, int i)  
      private static int hashUChar32​(int h, int c)  
      private static int initHash()  
      java.util.Iterator<Trie2.Range> iterator()
      Create an iterator over the value ranges in this Trie2.
      java.util.Iterator<Trie2.Range> iterator​(Trie2.ValueMapper mapper)
      Create an iterator over the value ranges from this Trie2.
      java.util.Iterator<Trie2.Range> iteratorForLeadSurrogate​(char lead)
      Create an iterator over the Trie2 values for the 1024=0x400 code points corresponding to a given lead surrogate.
      java.util.Iterator<Trie2.Range> iteratorForLeadSurrogate​(char lead, Trie2.ValueMapper mapper)
      Create an iterator over the Trie2 values for the 1024=0x400 code points corresponding to a given lead surrogate.
      (package private) int rangeEnd​(int start, int limitp, int val)
      Find the last character in a contiguous range of characters with the same Trie2 value as the input character.
      protected int serializeHeader​(java.io.DataOutputStream dos)
      Serialize a trie2 Header and Index onto an OutputStream.
      • Methods inherited from class java.lang.Object

        clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
      • Methods inherited from interface java.lang.Iterable

        forEach, spliterator
    • Field Detail

      • index

        char[] index
      • data16

        int data16
      • data32

        int[] data32
      • indexLength

        int indexLength
      • dataLength

        int dataLength
      • index2NullOffset

        int index2NullOffset
      • initialValue

        int initialValue
      • errorValue

        int errorValue
        Value returned for out-of-range code points and illegal UTF-8.
      • highStart

        int highStart
      • highValueIndex

        int highValueIndex
      • dataNullOffset

        int dataNullOffset
      • fHash

        int fHash
      • UTRIE2_OPTIONS_VALUE_BITS_MASK

        static final int UTRIE2_OPTIONS_VALUE_BITS_MASK
        Trie2 constants, defining shift widths, index array lengths, etc. These are needed for the runtime macros but users can treat these as implementation details and skip to the actual public API further below.
        See Also:
        Constant Field Values
      • UTRIE2_SHIFT_1

        static final int UTRIE2_SHIFT_1
        Shift size for getting the index-1 table offset.
        See Also:
        Constant Field Values
      • UTRIE2_SHIFT_2

        static final int UTRIE2_SHIFT_2
        Shift size for getting the index-2 table offset.
        See Also:
        Constant Field Values
      • UTRIE2_SHIFT_1_2

        static final int UTRIE2_SHIFT_1_2
        Difference between the two shift sizes, for getting an index-1 offset from an index-2 offset. 6=11-5
        See Also:
        Constant Field Values
      • UTRIE2_OMITTED_BMP_INDEX_1_LENGTH

        static final int UTRIE2_OMITTED_BMP_INDEX_1_LENGTH
        Number of index-1 entries for the BMP. 32=0x20 This part of the index-1 table is omitted from the serialized form.
        See Also:
        Constant Field Values
      • UTRIE2_CP_PER_INDEX_1_ENTRY

        static final int UTRIE2_CP_PER_INDEX_1_ENTRY
        Number of code points per index-1 table entry. 2048=0x800
        See Also:
        Constant Field Values
      • UTRIE2_INDEX_2_BLOCK_LENGTH

        static final int UTRIE2_INDEX_2_BLOCK_LENGTH
        Number of entries in an index-2 block. 64=0x40
        See Also:
        Constant Field Values
      • UTRIE2_INDEX_2_MASK

        static final int UTRIE2_INDEX_2_MASK
        Mask for getting the lower bits for the in-index-2-block offset.
        See Also:
        Constant Field Values
      • UTRIE2_DATA_BLOCK_LENGTH

        static final int UTRIE2_DATA_BLOCK_LENGTH
        Number of entries in a data block. 32=0x20
        See Also:
        Constant Field Values
      • UTRIE2_DATA_MASK

        static final int UTRIE2_DATA_MASK
        Mask for getting the lower bits for the in-data-block offset.
        See Also:
        Constant Field Values
      • UTRIE2_INDEX_SHIFT

        static final int UTRIE2_INDEX_SHIFT
        Shift size for shifting left the index array values. Increases possible data size with 16-bit index values at the cost of compactability. This requires data blocks to be aligned by UTRIE2_DATA_GRANULARITY.
        See Also:
        Constant Field Values
      • UTRIE2_DATA_GRANULARITY

        static final int UTRIE2_DATA_GRANULARITY
        The alignment size of a data block. Also the granularity for compaction.
        See Also:
        Constant Field Values
      • UTRIE2_INDEX_2_OFFSET

        static final int UTRIE2_INDEX_2_OFFSET
        The BMP part of the index-2 table is fixed and linear and starts at offset 0. Length=2048=0x800=0x10000>>UTRIE2_SHIFT_2.
        See Also:
        Constant Field Values
      • UTRIE2_LSCP_INDEX_2_OFFSET

        static final int UTRIE2_LSCP_INDEX_2_OFFSET
        The part of the index-2 table for U+D800..U+DBFF stores values for lead surrogate code _units_ not code _points_. Values for lead surrogate code _points_ are indexed with this portion of the table. Length=32=0x20=0x400>>UTRIE2_SHIFT_2. (There are 1024=0x400 lead surrogates.)
        See Also:
        Constant Field Values
      • UTRIE2_LSCP_INDEX_2_LENGTH

        static final int UTRIE2_LSCP_INDEX_2_LENGTH
        See Also:
        Constant Field Values
      • UTRIE2_INDEX_2_BMP_LENGTH

        static final int UTRIE2_INDEX_2_BMP_LENGTH
        Count the lengths of both BMP pieces. 2080=0x820
        See Also:
        Constant Field Values
      • UTRIE2_UTF8_2B_INDEX_2_OFFSET

        static final int UTRIE2_UTF8_2B_INDEX_2_OFFSET
        The 2-byte UTF-8 version of the index-2 table follows at offset 2080=0x820. Length 32=0x20 for lead bytes C0..DF, regardless of UTRIE2_SHIFT_2.
        See Also:
        Constant Field Values
      • UTRIE2_UTF8_2B_INDEX_2_LENGTH

        static final int UTRIE2_UTF8_2B_INDEX_2_LENGTH
        See Also:
        Constant Field Values
      • UTRIE2_INDEX_1_OFFSET

        static final int UTRIE2_INDEX_1_OFFSET
        The index-1 table, only used for supplementary code points, at offset 2112=0x840. Variable length, for code points up to highStart, where the last single-value range starts. Maximum length 512=0x200=0x100000>>UTRIE2_SHIFT_1. (For 0x100000 supplementary code points U+10000..U+10ffff.) The part of the index-2 table for supplementary code points starts after this index-1 table. Both the index-1 table and the following part of the index-2 table are omitted completely if there is only BMP data.
        See Also:
        Constant Field Values
      • UTRIE2_BAD_UTF8_DATA_OFFSET

        static final int UTRIE2_BAD_UTF8_DATA_OFFSET
        The illegal-UTF-8 data block follows the ASCII block, at offset 128=0x80. Used with linear access for single bytes 0..0xbf for simple error handling. Length 64=0x40, not UTRIE2_DATA_BLOCK_LENGTH.
        See Also:
        Constant Field Values
      • UTRIE2_DATA_START_OFFSET

        static final int UTRIE2_DATA_START_OFFSET
        The start of non-linear-ASCII data blocks, at offset 192=0xc0.
        See Also:
        Constant Field Values
      • UNEWTRIE2_INDEX_GAP_OFFSET

        static final int UNEWTRIE2_INDEX_GAP_OFFSET
        See Also:
        Constant Field Values
      • UNEWTRIE2_INDEX_GAP_LENGTH

        static final int UNEWTRIE2_INDEX_GAP_LENGTH
        See Also:
        Constant Field Values
      • UNEWTRIE2_MAX_INDEX_2_LENGTH

        static final int UNEWTRIE2_MAX_INDEX_2_LENGTH
        Maximum length of the build-time index-2 array. Maximum number of Unicode code points (0x110000) shifted right by UTRIE2_SHIFT_2, plus the part of the index-2 table for lead surrogate code points, plus the build-time index gap, plus the null index-2 block.
        See Also:
        Constant Field Values
      • UNEWTRIE2_MAX_DATA_LENGTH

        static final int UNEWTRIE2_MAX_DATA_LENGTH
        Maximum length of the build-time data array. One entry per 0x110000 code points, plus the illegal-UTF-8 block and the null block, plus values for the 0x400 surrogate code units.
        See Also:
        Constant Field Values
    • Constructor Detail

      • Trie2

        public Trie2()
    • Method Detail

      • createFromSerialized

        public static Trie2 createFromSerialized​(java.nio.ByteBuffer bytes)
                                          throws java.io.IOException
        Create a Trie2 from its serialized form. Inverse of utrie2_serialize(). Reads from the current position and leaves the buffer after the end of the trie. The serialized format is identical between ICU4C and ICU4J, so this function will work with serialized Trie2s from either. The actual type of the returned Trie2 will be either Trie2_16 or Trie2_32, depending on the width of the data. To obtain the width of the Trie2, check the actual class type of the returned Trie2. Or use the createFromSerialized() function of Trie2_16 or Trie2_32, which will return only Tries of their specific type/size. The serialized Trie2 on the stream may be in either little or big endian byte order. This allows using serialized Tries from ICU4C without needing to consider the byte order of the system that created them.
        Parameters:
        bytes - a byte buffer to the serialized form of a UTrie2.
        Returns:
        An unserialized Trie2, ready for use.
        Throws:
        java.lang.IllegalArgumentException - if the stream does not contain a serialized Trie2.
        java.io.IOException - if a read error occurs in the buffer.
      • getVersion

        public static int getVersion​(java.io.InputStream is,
                                     boolean littleEndianOk)
                              throws java.io.IOException
        Get the UTrie version from an InputStream containing the serialized form of either a Trie (version 1) or a Trie2 (version 2).
        Parameters:
        is - an InputStream containing the serialized form of a UTrie, version 1 or 2. The stream must support mark() and reset(). The position of the input stream will be left unchanged.
        littleEndianOk - If false, only big-endian (Java native) serialized forms are recognized. If true, little-endian serialized forms are recognized as well.
        Returns:
        the Trie version of the serialized form, or 0 if it is not recognized as a serialized UTrie
        Throws:
        java.io.IOException - on errors in reading from the input stream.
      • get

        public abstract int get​(int codePoint)
        Get the value for a code point as stored in the Trie2.
        Parameters:
        codePoint - the code point
        Returns:
        the value
      • getFromU16SingleLead

        public abstract int getFromU16SingleLead​(char c)
        Get the trie value for a UTF-16 code unit. A Trie2 stores two distinct values for input in the lead surrogate range, one for lead surrogates, which is the value that will be returned by this function, and a second value that is returned by Trie2.get(). For code units outside of the lead surrogate range, this function returns the same result as Trie2.get(). This function, together with the alternate value for lead surrogates, makes possible very efficient processing of UTF-16 strings without first converting surrogate pairs to their corresponding 32 bit code point values. At build-time, enumerate the contents of the Trie2 to see if there is non-trivial (non-initialValue) data for any of the supplementary code points associated with a lead surrogate. If so, then set a special (application-specific) value for the lead surrogate code _unit_, with Trie2Writable.setForLeadSurrogateCodeUnit(). At runtime, use Trie2.getFromU16SingleLead(). If there is non-trivial data and the code unit is a lead surrogate, then check if a trail surrogate follows. If so, assemble the supplementary code point and look up its value with Trie2.get(); otherwise reset the lead surrogate's value or do a code point lookup for it. If there is only trivial data for lead and trail surrogates, then processing can often skip them. For example, in normalization or case mapping all characters that do not have any mappings are simply copied as is.
        Parameters:
        c - the code point or lead surrogate value.
        Returns:
        the value
      • equals

        public final boolean equals​(java.lang.Object other)
        Equals function. Two Tries are equal if their contents are equal. The type need not be the same, so a Trie2Writable will be equal to (read-only) Trie2_16 or Trie2_32 so long as they are storing the same values.
        Overrides:
        equals in class java.lang.Object
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class java.lang.Object
      • iterator

        public java.util.Iterator<Trie2.Range> iterator()
        Create an iterator over the value ranges in this Trie2. Values from the Trie2 are not remapped or filtered, but are returned as they are stored in the Trie2.
        Specified by:
        iterator in interface java.lang.Iterable<Trie2.Range>
        Returns:
        an Iterator
      • iterator

        public java.util.Iterator<Trie2.Range> iterator​(Trie2.ValueMapper mapper)
        Create an iterator over the value ranges from this Trie2. Values from the Trie2 are passed through a caller-supplied remapping function, and it is the remapped values that determine the ranges that will be produced by the iterator.
        Parameters:
        mapper - provides a function to remap values obtained from the Trie2.
        Returns:
        an Iterator
      • iteratorForLeadSurrogate

        public java.util.Iterator<Trie2.Range> iteratorForLeadSurrogate​(char lead,
                                                                        Trie2.ValueMapper mapper)
        Create an iterator over the Trie2 values for the 1024=0x400 code points corresponding to a given lead surrogate. For example, for the lead surrogate U+D87E it will enumerate the values for [U+2F800..U+2FC00[. Used by data builder code that sets special lead surrogate code unit values for optimized UTF-16 string processing. Do not modify the Trie2 during the iteration. Except for the limited code point range, this functions just like Trie2.iterator().
      • iteratorForLeadSurrogate

        public java.util.Iterator<Trie2.Range> iteratorForLeadSurrogate​(char lead)
        Create an iterator over the Trie2 values for the 1024=0x400 code points corresponding to a given lead surrogate. For example, for the lead surrogate U+D87E it will enumerate the values for [U+2F800..U+2FC00[. Used by data builder code that sets special lead surrogate code unit values for optimized UTF-16 string processing. Do not modify the Trie2 during the iteration. Except for the limited code point range, this functions just like Trie2.iterator().
      • serializeHeader

        protected int serializeHeader​(java.io.DataOutputStream dos)
                               throws java.io.IOException
        Serialize a trie2 Header and Index onto an OutputStream. This is common code used for both the Trie2_16 and Trie2_32 serialize functions.
        Parameters:
        dos - the stream to which the serialized Trie2 data will be written.
        Returns:
        the number of bytes written.
        Throws:
        java.io.IOException
      • charSequenceIterator

        public Trie2.CharSequenceIterator charSequenceIterator​(java.lang.CharSequence text,
                                                               int index)
        Create an iterator that will produce the values from the Trie2 for the sequence of code points in an input text.
        Parameters:
        text - A text string to be iterated over.
        index - The starting iteration position within the input text.
        Returns:
        the CharSequenceIterator
      • rangeEnd

        int rangeEnd​(int start,
                     int limitp,
                     int val)
        Find the last character in a contiguous range of characters with the same Trie2 value as the input character.
        Parameters:
        c - The character to begin with.
        Returns:
        The last contiguous character with the same value.
      • initHash

        private static int initHash()
      • hashByte

        private static int hashByte​(int h,
                                    int b)
      • hashUChar32

        private static int hashUChar32​(int h,
                                       int c)
      • hashInt

        private static int hashInt​(int h,
                                   int i)