Class BOCSU


  • public class BOCSU
    extends java.lang.Object

    Binary Ordered Compression Scheme for Unicode

    Users are strongly encouraged to read the ICU paper on BOCU before attempting to use this class.

    BOCU is used to compress unicode text into a stream of unsigned bytes. For many kinds of text the compression compares favorably to UTF-8, and for some kinds of text (such as CJK) it does better. The resulting bytes will compare in the same order as the original code points. The byte stream does not contain the values 0, 1, or 2.

    One example of a use of BOCU is in com.ibm.icu.text.Collator#getCollationKey(String) for a RuleBasedCollator object with collation strength IDENTICAL. The result CollationKey will consist of the collation order of the source string followed by the BOCU result of the source string.

    Unlike a UTF encoding, BOCU-compressed text is not suitable for random access.

    Method: Slope Detection
    Remember the previous code point (initial 0). For each code point in the string, encode the difference with the previous one. Similar to a UTF, the length of the byte sequence is encoded in the lead bytes. Unlike a UTF, the trail byte values may overlap with lead/single byte values. The signedness of the difference must be encoded as the most significant part.

    We encode differences with few bytes if their absolute values are small. For correct ordering, we must treat the entire value range -10ffff..+10ffff in ascending order, which forbids encoding the sign and the absolute value separately. Instead, we split the lead byte range in the middle and encode non-negative values going up and negative values going down.

    For very small absolute values, the difference is added to a middle byte value for single-byte encoded differences. For somewhat larger absolute values, the difference is divided by the number of byte values available, the modulo is used for one trail byte, and the remainder is added to a lead byte avoiding the single-byte range. For large absolute values, the difference is similarly encoded in three bytes. (Syn Wee, I need examples here.)

    BOCU does not use byte values 0, 1, or 2, but uses all other byte values for lead and single bytes, so that the middle range of single bytes is as large as possible.

    Note that the lead byte ranges overlap some, but that the sequences as a whole are well ordered. I.e., even if the lead byte is the same for sequences of different lengths, the trail bytes establish correct order. It would be possible to encode slightly larger ranges for each length (>1) by subtracting the lower bound of the range. However, that would also slow down the calculation. (Syn Wee, need an example).

    For the actual string encoding, an optimization moves the previous code point value to the middle of its Unicode script block to minimize the differences in same-script text runs. (Syn Wee, need an example.)

    Since:
    release 2.2, May 3rd 2002
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private BOCSU()
      Constructor private to prevent initialization
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      private static void ensureAppendCapacity​(ByteArrayWrapper sink, int minCapacity, int desiredCapacity)  
      private static long getNegDivMod​(int number, int factor)
      Integer division and modulo with negative numerators yields negative modulo results and quotients that are one more than what we need here.
      private static int writeDiff​(int diff, byte[] buffer, int offset)
      Encode one difference value -0x10ffff..+0x10ffff in 1..4 bytes, preserving lexical order
      static int writeIdenticalLevelRun​(int prev, java.lang.CharSequence s, int i, int length, ByteArrayWrapper sink)
      Encode the code points of a string as a sequence of byte-encoded differences (slope detection), preserving lexical order.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • SLOPE_MIN_

        private static final int SLOPE_MIN_
        Do not use byte values 0, 1, 2 because they are separators in sort keys.
        See Also:
        Constant Field Values
      • SLOPE_SINGLE_

        private static final int SLOPE_SINGLE_
        Number of lead bytes: 1 middle byte for 0 2*80=160 single bytes for !=0 2*42=84 for double-byte values 2*3=6 for 3-byte values 2*1=2 for 4-byte values The sum must be <=SLOPE_TAIL_COUNT. Why these numbers? - There should be >=128 single-byte values to cover 128-blocks with small scripts. - There should be >=20902 single/double-byte values to cover Unihan. - It helps CJK Extension B some if there are 3-byte values that cover the distance between them and Unihan. This also helps to jump among distant places in the BMP. - Four-byte values are necessary to cover the rest of Unicode. Symmetrical lead byte counts are for convenience. With an equal distribution of even and odd differences there is also no advantage to asymmetrical lead byte counts.
        See Also:
        Constant Field Values
      • SLOPE_REACH_POS_1_

        private static final int SLOPE_REACH_POS_1_
        The difference value range for single-byters.
        See Also:
        Constant Field Values
      • SLOPE_REACH_POS_2_

        private static final int SLOPE_REACH_POS_2_
        The difference value range for double-byters.
        See Also:
        Constant Field Values
      • SLOPE_REACH_POS_3_

        private static final int SLOPE_REACH_POS_3_
        The difference value range for 3-byters.
        See Also:
        Constant Field Values
      • SLOPE_START_POS_2_

        private static final int SLOPE_START_POS_2_
        The lead byte start values.
        See Also:
        Constant Field Values
    • Constructor Detail

      • BOCSU

        private BOCSU()
        Constructor private to prevent initialization
    • Method Detail

      • writeIdenticalLevelRun

        public static int writeIdenticalLevelRun​(int prev,
                                                 java.lang.CharSequence s,
                                                 int i,
                                                 int length,
                                                 ByteArrayWrapper sink)
        Encode the code points of a string as a sequence of byte-encoded differences (slope detection), preserving lexical order.

        Optimize the difference-taking for runs of Unicode text within small scripts:

        Most small scripts are allocated within aligned 128-blocks of Unicode code points. Lexical order is preserved if "prev" is always moved into the middle of such a block.

        Additionally, "prev" is moved from anywhere in the Unihan area into the middle of that area. Note that the identical-level run in a sort key is generated from NFD text - there are never Hangul characters included.

      • ensureAppendCapacity

        private static void ensureAppendCapacity​(ByteArrayWrapper sink,
                                                 int minCapacity,
                                                 int desiredCapacity)
      • getNegDivMod

        private static final long getNegDivMod​(int number,
                                               int factor)
        Integer division and modulo with negative numerators yields negative modulo results and quotients that are one more than what we need here.
        Parameters:
        number - which operations are to be performed on
        factor - the factor to use for division
        Returns:
        (result of division) << 32 | modulo
      • writeDiff

        private static final int writeDiff​(int diff,
                                           byte[] buffer,
                                           int offset)
        Encode one difference value -0x10ffff..+0x10ffff in 1..4 bytes, preserving lexical order
        Parameters:
        diff -
        buffer - byte buffer to append to
        offset - to the byte buffer to start appending
        Returns:
        end offset where the appending stops