Class UnicodeCompressor

  • All Implemented Interfaces:
    SCSU

    public final class UnicodeCompressor
    extends java.lang.Object
    implements SCSU
    A compression engine implementing the Standard Compression Scheme for Unicode (SCSU) as outlined in Unicode Technical Report #6.

    The SCSU works by using dynamically positioned windows consisting of 128 consecutive characters in Unicode. During compression, characters within a window are encoded in the compressed stream as the bytes 0x7F - 0xFF. The SCSU provides transparency for the characters (bytes) between U+0000 - U+00FF. The SCSU approximates the storage size of traditional character sets, for example 1 byte per character for ASCII or Latin-1 text, and 2 bytes per character for CJK ideographs.

    USAGE

    The static methods on UnicodeCompressor may be used in a straightforward manner to compress simple strings:

      String s = ... ; // get string from somewhere
      byte [] compressed = UnicodeCompressor.compress(s);
     

    The static methods have a fairly large memory footprint. For finer-grained control over memory usage, UnicodeCompressor offers more powerful APIs allowing iterative compression:

      // Compress an array "chars" of length "len" using a buffer of 512 bytes
      // to the OutputStream "out"
    
      UnicodeCompressor myCompressor         = new UnicodeCompressor();
      final static int  BUFSIZE              = 512;
      byte []           byteBuffer           = new byte [ BUFSIZE ];
      int               bytesWritten         = 0;
      int []            unicharsRead         = new int [1];
      int               totalCharsCompressed = 0;
      int               totalBytesWritten    = 0;
    
      do {
        // do the compression
        bytesWritten = myCompressor.compress(chars, totalCharsCompressed, 
                                             len, unicharsRead,
                                             byteBuffer, 0, BUFSIZE);
    
        // do something with the current set of bytes
        out.write(byteBuffer, 0, bytesWritten);
    
        // update the no. of characters compressed
        totalCharsCompressed += unicharsRead[0];
    
        // update the no. of bytes written
        totalBytesWritten += bytesWritten;
    
      } while(totalCharsCompressed < len);
    
      myCompressor.reset(); // reuse compressor
     
    See Also:
    UnicodeDecompressor
    • Field Detail

      • sSingleTagTable

        private static boolean[] sSingleTagTable
        For quick identification of a byte as a single-byte mode tag
      • sUnicodeTagTable

        private static boolean[] sUnicodeTagTable
        For quick identification of a byte as a unicode mode tag
      • fCurrentWindow

        private int fCurrentWindow
        Alias to current dynamic window
      • fOffsets

        private int[] fOffsets
        Dynamic compression window offsets
      • fMode

        private int fMode
        Current compression mode
      • fIndexCount

        private int[] fIndexCount
        Keeps count of times character indices are encountered
      • fTimeStamps

        private int[] fTimeStamps
        The time stamps indicate when a window was last defined
      • fTimeStamp

        private int fTimeStamp
        The current time stamp
    • Constructor Detail

      • UnicodeCompressor

        public UnicodeCompressor()
        Create a UnicodeCompressor. Sets all windows to their default values.
        See Also:
        reset()
    • Method Detail

      • compress

        public static byte[] compress​(java.lang.String buffer)
        Compress a string into a byte array.
        Parameters:
        buffer - The string to compress.
        Returns:
        A byte array containing the compressed characters.
        See Also:
        compress(char [], int, int)
      • compress

        public static byte[] compress​(char[] buffer,
                                      int start,
                                      int limit)
        Compress a Unicode character array into a byte array.
        Parameters:
        buffer - The character buffer to compress.
        start - The start of the character run to compress.
        limit - The limit of the character run to compress.
        Returns:
        A byte array containing the compressed characters.
        See Also:
        compress(String)
      • compress

        public int compress​(char[] charBuffer,
                            int charBufferStart,
                            int charBufferLimit,
                            int[] charsRead,
                            byte[] byteBuffer,
                            int byteBufferStart,
                            int byteBufferLimit)
        Compress a Unicode character array into a byte array. This function will only consume input that can be completely output.
        Parameters:
        charBuffer - The character buffer to compress.
        charBufferStart - The start of the character run to compress.
        charBufferLimit - The limit of the character run to compress.
        charsRead - A one-element array. If not null, on return the number of characters read from charBuffer.
        byteBuffer - A buffer to receive the compressed data. This buffer must be at minimum four bytes in size.
        byteBufferStart - The starting offset to which to write compressed data.
        byteBufferLimit - The limiting offset for writing compressed data.
        Returns:
        The number of bytes written to byteBuffer.
      • reset

        public void reset()
        Reset the compressor to its initial state.
      • makeIndex

        private static int makeIndex​(int c)
        Create the index value for a character. For more information on this function, refer to table X-3 UTR6.
        Parameters:
        c - The character in question.
        Returns:
        An index for c
      • inDynamicWindow

        private boolean inDynamicWindow​(int c,
                                        int whichWindow)
        Determine if a character is in a dynamic window.
        Parameters:
        c - The character to test
        whichWindow - The dynamic window the test
        Returns:
        true if c will fit in whichWindow, false otherwise.
      • inStaticWindow

        private static boolean inStaticWindow​(int c,
                                              int whichWindow)
        Determine if a character is in a static window.
        Parameters:
        c - The character to test
        whichWindow - The static window the test
        Returns:
        true if c will fit in whichWindow, false otherwise.
      • isCompressible

        private static boolean isCompressible​(int c)
        Determine if a character is compressible.
        Parameters:
        c - The character to test.
        Returns:
        true if the c is compressible, false otherwise.
      • findDynamicWindow

        private int findDynamicWindow​(int c)
        Determine if a dynamic window for a certain character is defined
        Parameters:
        c - The character in question
        Returns:
        The dynamic window containing c, or INVALIDWINDOW if not defined.
      • findStaticWindow

        private static int findStaticWindow​(int c)
        Determine if a static window for a certain character is defined
        Parameters:
        c - The character in question
        Returns:
        The static window containing c, or INVALIDWINDOW if not defined.
      • getLRDefinedWindow

        private int getLRDefinedWindow()
        Find the least-recently defined window