Class Collation


  • public final class Collation
    extends java.lang.Object
    Collation v2 basic definitions and static helper functions. Data structures except for expansion tables store 32-bit CEs which are either specials (see tags below) or are compact forms of 64-bit CEs.
    • Field Detail

      • SENTINEL_CP

        public static final int SENTINEL_CP
        UChar32 U_SENTINEL. TODO: Create a common, public constant?
        See Also:
        Constant Field Values
      • BEFORE_WEIGHT16

        static final int BEFORE_WEIGHT16
        The secondary/tertiary lower limit for tailoring before any root elements.
        See Also:
        Constant Field Values
      • MERGE_SEPARATOR_BYTE

        public static final int MERGE_SEPARATOR_BYTE
        Merge-sort-key separator. Same as the unique primary and identical-level weights of U+FFFE. Must not be used as primary compression low terminator. Otherwise usable.
        See Also:
        Constant Field Values
      • MERGE_SEPARATOR_PRIMARY

        public static final long MERGE_SEPARATOR_PRIMARY
        See Also:
        Constant Field Values
      • PRIMARY_COMPRESSION_LOW_BYTE

        public static final int PRIMARY_COMPRESSION_LOW_BYTE
        Primary compression low terminator, must be greater than MERGE_SEPARATOR_BYTE. Reserved value in primary second byte if the lead byte is compressible. Otherwise usable in all CE weight bytes.
        See Also:
        Constant Field Values
      • PRIMARY_COMPRESSION_HIGH_BYTE

        public static final int PRIMARY_COMPRESSION_HIGH_BYTE
        Primary compression high terminator. Reserved value in primary second byte if the lead byte is compressible. Otherwise usable in all CE weight bytes.
        See Also:
        Constant Field Values
      • COMMON_BYTE

        static final int COMMON_BYTE
        Default secondary/tertiary weight lead byte.
        See Also:
        Constant Field Values
      • COMMON_SECONDARY_CE

        static final int COMMON_SECONDARY_CE
        Middle 16 bits of a CE with a common secondary weight.
        See Also:
        Constant Field Values
      • COMMON_TERTIARY_CE

        static final int COMMON_TERTIARY_CE
        Lower 16 bits of a CE with a common tertiary weight.
        See Also:
        Constant Field Values
      • COMMON_SEC_AND_TER_CE

        public static final int COMMON_SEC_AND_TER_CE
        Lower 32 bits of a CE with common secondary and tertiary weights.
        See Also:
        Constant Field Values
      • ONLY_TERTIARY_MASK

        public static final int ONLY_TERTIARY_MASK
        Only the 2*6 bits for the pure tertiary weight.
        See Also:
        Constant Field Values
      • ONLY_SEC_TER_MASK

        static final int ONLY_SEC_TER_MASK
        Only the secondary & tertiary bits; no case, no quaternary.
        See Also:
        Constant Field Values
      • CASE_AND_TERTIARY_MASK

        static final int CASE_AND_TERTIARY_MASK
        Case bits and tertiary bits.
        See Also:
        Constant Field Values
      • CASE_AND_QUATERNARY_MASK

        public static final int CASE_AND_QUATERNARY_MASK
        Case bits and quaternary bits.
        See Also:
        Constant Field Values
      • FIRST_UNASSIGNED_PRIMARY

        static final long FIRST_UNASSIGNED_PRIMARY
        First unassigned: AlphabeticIndex overflow boundary. We want a 3-byte primary so that it fits into the root elements table. This 3-byte primary will not collide with any unassigned-implicit 4-byte primaries because the first few hundred Unicode code points all have real mappings.
        See Also:
        Constant Field Values
      • SPECIAL_CE32_LOW_BYTE

        static final int SPECIAL_CE32_LOW_BYTE
        A CE32 is special if its low byte is this or greater. Impossible case bits 11 mark special CE32s. This value itself is used to indicate a fallback to the base collator.
        See Also:
        Constant Field Values
      • LONG_PRIMARY_CE32_LOW_BYTE

        static final int LONG_PRIMARY_CE32_LOW_BYTE
        Low byte of a long-primary special CE32.
        See Also:
        Constant Field Values
      • NO_CE_PRIMARY

        static final long NO_CE_PRIMARY
        No CE: End of input. Only used in runtime code, not stored in data.
        See Also:
        Constant Field Values
      • NO_LEVEL_FLAG

        static final int NO_LEVEL_FLAG
        Sort key level flags: xx_FLAG = 1 << xx_LEVEL. In Java, use enum Level with flag() getters, or use EnumSet rather than hand-made bit sets.
        See Also:
        Constant Field Values
      • FALLBACK_TAG

        static final int FALLBACK_TAG
        Fall back to the base collator. This is the tag value in SPECIAL_CE32_LOW_BYTE and FALLBACK_CE32. Bits 31..8: Unused, 0.
        See Also:
        Constant Field Values
      • LONG_PRIMARY_TAG

        static final int LONG_PRIMARY_TAG
        Long-primary CE with COMMON_SEC_AND_TER_CE. Bits 31..8: Three-byte primary.
        See Also:
        Constant Field Values
      • LONG_SECONDARY_TAG

        static final int LONG_SECONDARY_TAG
        Long-secondary CE with zero primary. Bits 31..16: Secondary weight. Bits 15.. 8: Tertiary weight.
        See Also:
        Constant Field Values
      • RESERVED_TAG_3

        static final int RESERVED_TAG_3
        Unused. May be used in the future for single-byte secondary CEs (SHORT_SECONDARY_TAG), storing the secondary in bits 31..24, the ccc in bits 23..16, and the tertiary in bits 15..8.
        See Also:
        Constant Field Values
      • LATIN_EXPANSION_TAG

        static final int LATIN_EXPANSION_TAG
        Latin mini expansions of two simple CEs [pp, 05, tt] [00, ss, 05]. Bits 31..24: Single-byte primary weight pp of the first CE. Bits 23..16: Tertiary weight tt of the first CE. Bits 15.. 8: Secondary weight ss of the second CE.
        See Also:
        Constant Field Values
      • EXPANSION32_TAG

        static final int EXPANSION32_TAG
        Points to one or more simple/long-primary/long-secondary 32-bit CE32s. Bits 31..13: Index into int table. Bits 12.. 8: Length=1..31.
        See Also:
        Constant Field Values
      • EXPANSION_TAG

        static final int EXPANSION_TAG
        Points to one or more 64-bit CEs. Bits 31..13: Index into CE table. Bits 12.. 8: Length=1..31.
        See Also:
        Constant Field Values
      • BUILDER_DATA_TAG

        static final int BUILDER_DATA_TAG
        Builder data, used only in the CollationDataBuilder, not in runtime data. If bit 8 is 0: Builder context, points to a list of context-sensitive mappings. Bits 31..13: Index to the builder's list of ConditionalCE32 for this character. Bits 12.. 9: Unused, 0. If bit 8 is 1 (IS_BUILDER_JAMO_CE32): Builder-only jamoCE32 value. The builder fetches the Jamo CE32 from the trie. Bits 31..13: Jamo code point. Bits 12.. 9: Unused, 0.
        See Also:
        Constant Field Values
      • PREFIX_TAG

        static final int PREFIX_TAG
        Points to prefix trie. Bits 31..13: Index into prefix/contraction data. Bits 12.. 8: Unused, 0.
        See Also:
        Constant Field Values
      • CONTRACTION_TAG

        static final int CONTRACTION_TAG
        Points to contraction data. Bits 31..13: Index into prefix/contraction data. Bits 12..11: Unused, 0. Bit 10: CONTRACT_TRAILING_CCC flag. Bit 9: CONTRACT_NEXT_CCC flag. Bit 8: CONTRACT_SINGLE_CP_NO_MATCH flag.
        See Also:
        Constant Field Values
      • DIGIT_TAG

        static final int DIGIT_TAG
        Decimal digit. Bits 31..13: Index into int table for non-numeric-collation CE32. Bit 12: Unused, 0. Bits 11.. 8: Digit value 0..9.
        See Also:
        Constant Field Values
      • U0000_TAG

        static final int U0000_TAG
        Tag for U+0000, for moving the NUL-termination handling from the regular fastpath into specials-handling code. Bits 31..8: Unused, 0.
        See Also:
        Constant Field Values
      • HANGUL_TAG

        static final int HANGUL_TAG
        Tag for a Hangul syllable. Bits 31..9: Unused, 0. Bit 8: HANGUL_NO_SPECIAL_JAMO flag.
        See Also:
        Constant Field Values
      • LEAD_SURROGATE_TAG

        static final int LEAD_SURROGATE_TAG
        Tag for a lead surrogate code unit. Optional optimization for UTF-16 string processing. Bits 31..10: Unused, 0. 9.. 8: =0: All associated supplementary code points are unassigned-implicit. =1: All associated supplementary code points fall back to the base data. else: (Normally 2) Look up the data for the supplementary code point.
        See Also:
        Constant Field Values
      • OFFSET_TAG

        static final int OFFSET_TAG
        Tag for CEs with primary weights in code point order. Bits 31..13: Index into CE table, for one data "CE". Bits 12.. 8: Unused, 0. This data "CE" has the following bit fields: Bits 63..32: Three-byte primary pppppp00. 31.. 8: Start/base code point of the in-order range. 7: Flag isCompressible primary. 6.. 0: Per-code point primary-weight increment.
        See Also:
        Constant Field Values
      • IMPLICIT_TAG

        static final int IMPLICIT_TAG
        Implicit CE tag. Compute an unassigned-implicit CE. All bits are set (UNASSIGNED_CE32=0xffffffff).
        See Also:
        Constant Field Values
      • MAX_EXPANSION_LENGTH

        static final int MAX_EXPANSION_LENGTH
        We limit the number of CEs in an expansion so that we can use a small number of length bits in the data structure, and so that an implementation can copy CEs at runtime without growing a destination buffer.
        See Also:
        Constant Field Values
      • CONTRACT_SINGLE_CP_NO_MATCH

        static final int CONTRACT_SINGLE_CP_NO_MATCH
        Set if there is no match for the single (no-suffix) character itself. This is only possible if there is a prefix. In this case, discontiguous contraction matching cannot add combining marks starting from an empty suffix. The default CE32 is used anyway if there is no suffix match.
        See Also:
        Constant Field Values
      • CONTRACT_NEXT_CCC

        static final int CONTRACT_NEXT_CCC
        Set if the first character of every contraction suffix has lccc!=0.
        See Also:
        Constant Field Values
      • CONTRACT_TRAILING_CCC

        static final int CONTRACT_TRAILING_CCC
        Set if any contraction suffix ends with lccc!=0.
        See Also:
        Constant Field Values
      • HANGUL_NO_SPECIAL_JAMO

        static final int HANGUL_NO_SPECIAL_JAMO
        For HANGUL_TAG: None of its Jamo CE32s isSpecialCE32().
        See Also:
        Constant Field Values
    • Constructor Detail

      • Collation

        public Collation()
    • Method Detail

      • isAssignedCE32

        static boolean isAssignedCE32​(int ce32)
      • makeLongPrimaryCE32

        static int makeLongPrimaryCE32​(long p)
      • primaryFromLongPrimaryCE32

        static long primaryFromLongPrimaryCE32​(int ce32)
        Turns the long-primary CE32 into a primary weight pppppp00.
      • ceFromLongPrimaryCE32

        static long ceFromLongPrimaryCE32​(int ce32)
      • makeLongSecondaryCE32

        static int makeLongSecondaryCE32​(int lower32)
      • ceFromLongSecondaryCE32

        static long ceFromLongSecondaryCE32​(int ce32)
      • makeCE32FromTagIndexAndLength

        static int makeCE32FromTagIndexAndLength​(int tag,
                                                 int index,
                                                 int length)
        Makes a special CE32 with tag, index and length.
      • makeCE32FromTagAndIndex

        static int makeCE32FromTagAndIndex​(int tag,
                                           int index)
        Makes a special CE32 with only tag and index.
      • isSpecialCE32

        static boolean isSpecialCE32​(int ce32)
      • tagFromCE32

        static int tagFromCE32​(int ce32)
      • hasCE32Tag

        static boolean hasCE32Tag​(int ce32,
                                  int tag)
      • isLongPrimaryCE32

        static boolean isLongPrimaryCE32​(int ce32)
      • isSimpleOrLongCE32

        static boolean isSimpleOrLongCE32​(int ce32)
      • isSelfContainedCE32

        static boolean isSelfContainedCE32​(int ce32)
        Returns:
        true if the ce32 yields one or more CEs without further data lookups
      • isPrefixCE32

        static boolean isPrefixCE32​(int ce32)
      • isContractionCE32

        static boolean isContractionCE32​(int ce32)
      • ce32HasContext

        static boolean ce32HasContext​(int ce32)
      • latinCE0FromCE32

        static long latinCE0FromCE32​(int ce32)
        Get the first of the two Latin-expansion CEs encoded in ce32.
        See Also:
        LATIN_EXPANSION_TAG
      • latinCE1FromCE32

        static long latinCE1FromCE32​(int ce32)
        Get the second of the two Latin-expansion CEs encoded in ce32.
        See Also:
        LATIN_EXPANSION_TAG
      • indexFromCE32

        static int indexFromCE32​(int ce32)
        Returns the data index from a special CE32.
      • lengthFromCE32

        static int lengthFromCE32​(int ce32)
        Returns the data length from a ce32.
      • digitFromCE32

        static char digitFromCE32​(int ce32)
        Returns the digit value from a DIGIT_TAG ce32.
      • ceFromSimpleCE32

        static long ceFromSimpleCE32​(int ce32)
        Returns a 64-bit CE from a simple CE32 (not special).
      • ceFromCE32

        static long ceFromCE32​(int ce32)
        Returns a 64-bit CE from a simple/long-primary/long-secondary CE32.
      • makeCE

        public static long makeCE​(long p)
        Creates a CE from a primary weight.
      • makeCE

        static long makeCE​(long p,
                           int s,
                           int t,
                           int q)
        Creates a CE from a primary weight, 16-bit secondary/tertiary weights, and a 2-bit quaternary.
      • incTwoBytePrimaryByOffset

        public static long incTwoBytePrimaryByOffset​(long basePrimary,
                                                     boolean isCompressible,
                                                     int offset)
        Increments a 2-byte primary by a code point offset.
      • incThreeBytePrimaryByOffset

        public static long incThreeBytePrimaryByOffset​(long basePrimary,
                                                       boolean isCompressible,
                                                       int offset)
        Increments a 3-byte primary by a code point offset.
      • decTwoBytePrimaryByOneStep

        static long decTwoBytePrimaryByOneStep​(long basePrimary,
                                               boolean isCompressible,
                                               int step)
        Decrements a 2-byte primary by one range step (1..0x7f).
      • decThreeBytePrimaryByOneStep

        static long decThreeBytePrimaryByOneStep​(long basePrimary,
                                                 boolean isCompressible,
                                                 int step)
        Decrements a 3-byte primary by one range step (1..0x7f).
      • getThreeBytePrimaryForOffsetData

        static long getThreeBytePrimaryForOffsetData​(int c,
                                                     long dataCE)
        Computes a 3-byte primary for c's OFFSET_TAG data "CE".
      • unassignedPrimaryFromCodePoint

        static long unassignedPrimaryFromCodePoint​(int c)
        Returns the unassigned-character implicit primary weight for any valid code point c.
      • unassignedCEFromCodePoint

        static long unassignedCEFromCodePoint​(int c)