Class Trie

  • Direct Known Subclasses:
    CharTrie, IntTrie

    public abstract class Trie
    extends java.lang.Object

    A trie is a kind of compressed, serializable table of values associated with Unicode code points (0..0x10ffff).

    This class defines the basic structure of a trie and provides methods to retrieve the offsets to the actual data.

    Data will be the form of an array of basic types, char or int.

    The actual data format will have to be specified by the user in the inner static interface com.ibm.icu.impl.Trie.DataManipulate.

    This trie implementation is optimized for getting offset while walking forward through a UTF-16 string. Therefore, the simplest and fastest access macros are the fromLead() and fromOffsetTrail() methods. The fromBMP() method are a little more complicated; they get offsets even for lead surrogate codepoints, while the fromLead() method get special "folded" offsets for lead surrogate code units if there is relevant data associated with them. From such a folded offsets, an offset needs to be extracted to supply to the fromOffsetTrail() methods. To handle such supplementary codepoints, some offset information are kept in the data.

    Methods in com.ibm.icu.impl.Trie.DataManipulate are called to retrieve that offset from the folded value for the lead surrogate unit.

    For examples of use, see com.ibm.icu.impl.CharTrie or com.ibm.icu.impl.IntTrie.

    Since:
    release 2.1, Jan 01 2002
    See Also:
    CharTrie, IntTrie
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static interface  Trie.DataManipulate
      Character data in com.ibm.impl.Trie have different user-specified format for different purposes.
      private static class  Trie.DefaultGetFoldingOffset  
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      protected Trie​(char[] index, int options, Trie.DataManipulate dataManipulate)
      Trie constructor
      protected Trie​(java.nio.ByteBuffer bytes, Trie.DataManipulate dataManipulate)
      Trie constructor for CharTrie use.
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      private boolean checkHeader​(int signature)
      Authenticates raw data header.
      boolean equals​(java.lang.Object other)
      Checks if the argument Trie has the same data as this Trie.
      protected int getBMPOffset​(char ch)
      Gets the offset to data which the BMP character points to Treats a lead surrogate as a normal code point.
      protected int getCodePointOffset​(int ch)
      Internal trie getter from a code point.
      protected abstract int getInitialValue()
      Gets the default initial value
      protected int getLeadOffset​(char ch)
      Gets the offset to the data which this lead surrogate character points to.
      protected int getRawOffset​(int offset, char ch)
      Gets the offset to the data which the index ch after variable offset points to.
      int getSerializedDataSize()
      Gets the serialized data file size of the Trie.
      protected abstract int getSurrogateOffset​(char lead, char trail)
      Gets the offset to the data which the surrogate pair points to.
      protected abstract int getValue​(int index)
      Gets the value at the argument index
      int hashCode()  
      protected boolean isCharTrie()
      Determines if this is a 16 bit trie
      protected boolean isIntTrie()
      Determines if this is a 32 bit trie
      boolean isLatin1Linear()
      Determines if this trie has a linear latin 1 array
      protected void unserialize​(java.nio.ByteBuffer bytes)
      Parses the byte buffer and creates the trie index with it.
      • Methods inherited from class java.lang.Object

        clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LEAD_INDEX_OFFSET_

        protected static final int LEAD_INDEX_OFFSET_
        Lead surrogate code points' index displacement in the index array. 0x10000-0xd800=0x2800 0x2800 >> INDEX_STAGE_1_SHIFT_
        See Also:
        Constant Field Values
      • INDEX_STAGE_1_SHIFT_

        protected static final int INDEX_STAGE_1_SHIFT_
        Shift size for shifting right the input index. 1..9
        See Also:
        Constant Field Values
      • INDEX_STAGE_2_SHIFT_

        protected static final int INDEX_STAGE_2_SHIFT_
        Shift size for shifting left the index array values. Increases possible data size with 16-bit index values at the cost of compactability. This requires blocks of stage 2 data to be aligned by DATA_GRANULARITY. 0..INDEX_STAGE_1_SHIFT
        See Also:
        Constant Field Values
      • DATA_BLOCK_LENGTH

        protected static final int DATA_BLOCK_LENGTH
        Number of data values in a stage 2 (data array) block.
        See Also:
        Constant Field Values
      • INDEX_STAGE_3_MASK_

        protected static final int INDEX_STAGE_3_MASK_
        Mask for getting the lower bits from the input index. DATA_BLOCK_LENGTH - 1.
        See Also:
        Constant Field Values
      • SURROGATE_BLOCK_BITS

        protected static final int SURROGATE_BLOCK_BITS
        Number of bits of a trail surrogate that are used in index table lookups.
        See Also:
        Constant Field Values
      • SURROGATE_BLOCK_COUNT

        protected static final int SURROGATE_BLOCK_COUNT
        Number of index (stage 1) entries per lead surrogate. Same as number of index entries for 1024 trail surrogates, ==0x400>>INDEX_STAGE_1_SHIFT_
        See Also:
        Constant Field Values
      • BMP_INDEX_LENGTH

        protected static final int BMP_INDEX_LENGTH
        Length of the BMP portion of the index (stage 1) array.
        See Also:
        Constant Field Values
      • SURROGATE_MASK_

        protected static final int SURROGATE_MASK_
        Surrogate mask to use when shifting offset to retrieve supplementary values
        See Also:
        Constant Field Values
      • m_index_

        protected char[] m_index_
        Index or UTF16 characters
      • m_dataManipulate_

        protected Trie.DataManipulate m_dataManipulate_
        Internal TrieValue which handles the parsing of the data value. This class is to be implemented by the user
      • m_dataOffset_

        protected int m_dataOffset_
        Start index of the data portion of the trie. CharTrie combines index and data into a char array, so this is used to indicate the initial offset to the data portion. Note this index always points to the initial value.
      • m_dataLength_

        protected int m_dataLength_
        Length of the data array
      • HEADER_LENGTH_

        protected static final int HEADER_LENGTH_
        Size of Trie header in bytes
        See Also:
        Constant Field Values
      • HEADER_OPTIONS_LATIN1_IS_LINEAR_MASK_

        protected static final int HEADER_OPTIONS_LATIN1_IS_LINEAR_MASK_
        Latin 1 option mask
        See Also:
        Constant Field Values
      • HEADER_SIGNATURE_

        protected static final int HEADER_SIGNATURE_
        Constant number to authenticate the byte block
        See Also:
        Constant Field Values
      • HEADER_OPTIONS_SHIFT_MASK_

        private static final int HEADER_OPTIONS_SHIFT_MASK_
        Header option formatting
        See Also:
        Constant Field Values
      • HEADER_OPTIONS_INDEX_SHIFT_

        protected static final int HEADER_OPTIONS_INDEX_SHIFT_
        See Also:
        Constant Field Values
      • HEADER_OPTIONS_DATA_IS_32_BIT_

        protected static final int HEADER_OPTIONS_DATA_IS_32_BIT_
        See Also:
        Constant Field Values
      • m_isLatin1Linear_

        private boolean m_isLatin1Linear_
        Flag indicator for Latin quick access data block
      • m_options_

        private int m_options_

        Trie options field.

        options bit field:
        9 1 = Latin-1 data is stored linearly at data + DATA_BLOCK_LENGTH
        8 0 = 16-bit data, 1=32-bit data
        7..4 INDEX_STAGE_1_SHIFT // 0..INDEX_STAGE_2_SHIFT
        3..0 INDEX_STAGE_2_SHIFT // 1..9

    • Constructor Detail

      • Trie

        protected Trie​(java.nio.ByteBuffer bytes,
                       Trie.DataManipulate dataManipulate)
        Trie constructor for CharTrie use.
        Parameters:
        bytes - data of an ICU data file, containing the trie
        dataManipulate - object containing the information to parse the trie data
      • Trie

        protected Trie​(char[] index,
                       int options,
                       Trie.DataManipulate dataManipulate)
        Trie constructor
        Parameters:
        index - array to be used for index
        options - used by the trie
        dataManipulate - object containing the information to parse the trie data
    • Method Detail

      • isLatin1Linear

        public final boolean isLatin1Linear()
        Determines if this trie has a linear latin 1 array
        Returns:
        true if this trie has a linear latin 1 array, false otherwise
      • equals

        public boolean equals​(java.lang.Object other)
        Checks if the argument Trie has the same data as this Trie. Attributes are checked but not the index data.
        Overrides:
        equals in class java.lang.Object
        Parameters:
        other - Trie to check
        Returns:
        true if the argument Trie has the same data as this Trie, false otherwise
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class java.lang.Object
      • getSerializedDataSize

        public int getSerializedDataSize()
        Gets the serialized data file size of the Trie. This is used during trie data reading for size checking purposes.
        Returns:
        size size of serialized trie data file in terms of the number of bytes
      • getSurrogateOffset

        protected abstract int getSurrogateOffset​(char lead,
                                                  char trail)
        Gets the offset to the data which the surrogate pair points to.
        Parameters:
        lead - lead surrogate
        trail - trailing surrogate
        Returns:
        offset to data
      • getValue

        protected abstract int getValue​(int index)
        Gets the value at the argument index
        Parameters:
        index - value at index will be retrieved
        Returns:
        32 bit value
      • getInitialValue

        protected abstract int getInitialValue()
        Gets the default initial value
        Returns:
        32 bit value
      • getRawOffset

        protected final int getRawOffset​(int offset,
                                         char ch)
        Gets the offset to the data which the index ch after variable offset points to. Note for locating a non-supplementary character data offset, calling

        getRawOffset(0, ch);

        will do. Otherwise if it is a supplementary character formed by surrogates lead and trail. Then we would have to call getRawOffset() with getFoldingIndexOffset(). See getSurrogateOffset().
        Parameters:
        offset - index offset which ch is to start from
        ch - index to be used after offset
        Returns:
        offset to the data
      • getBMPOffset

        protected final int getBMPOffset​(char ch)
        Gets the offset to data which the BMP character points to Treats a lead surrogate as a normal code point.
        Parameters:
        ch - BMP character
        Returns:
        offset to data
      • getLeadOffset

        protected final int getLeadOffset​(char ch)
        Gets the offset to the data which this lead surrogate character points to. Data at the returned offset may contain folding offset information for the next trailing surrogate character.
        Parameters:
        ch - lead surrogate character
        Returns:
        offset to data
      • getCodePointOffset

        protected final int getCodePointOffset​(int ch)
        Internal trie getter from a code point. Could be faster(?) but longer with if((c32)<=0xd7ff) { (result)=_TRIE_GET_RAW(trie, data, 0, c32); } Gets the offset to data which the codepoint points to
        Parameters:
        ch - codepoint
        Returns:
        offset to data
      • unserialize

        protected void unserialize​(java.nio.ByteBuffer bytes)

        Parses the byte buffer and creates the trie index with it.

        The position of the input ByteBuffer must be right after the trie header.

        This is overwritten by the child classes.

        Parameters:
        bytes - buffer containing trie data
      • isIntTrie

        protected final boolean isIntTrie()
        Determines if this is a 32 bit trie
        Returns:
        true if options specifies this is a 32 bit trie
      • isCharTrie

        protected final boolean isCharTrie()
        Determines if this is a 16 bit trie
        Returns:
        true if this is a 16 bit trie
      • checkHeader

        private final boolean checkHeader​(int signature)
        Authenticates raw data header. Checking the header information, signature and options.
        Parameters:
        signature - This contains the options and type of a Trie
        Returns:
        true if the header is authenticated valid