Class Normalizer2Impl


  • public final class Normalizer2Impl
    extends java.lang.Object
    Low-level implementation of the Unicode Normalization Algorithm. For the data structure and details see the documentation at the end of C++ normalizer2impl.h and in the design doc at https://unicode-org.github.io/icu/design/normalization/custom.html
    • Field Detail

      • HAS_COMP_BOUNDARY_AFTER

        public static final int HAS_COMP_BOUNDARY_AFTER
        See Also:
        Constant Field Values
      • IX_MIN_COMP_NO_MAYBE_CP

        public static final int IX_MIN_COMP_NO_MAYBE_CP
        See Also:
        Constant Field Values
      • IX_MIN_YES_NO

        public static final int IX_MIN_YES_NO
        Mappings & compositions in [minYesNo..minYesNoMappingsOnly[.
        See Also:
        Constant Field Values
      • IX_MIN_NO_NO

        public static final int IX_MIN_NO_NO
        Mappings are comp-normalized.
        See Also:
        Constant Field Values
      • IX_MIN_YES_NO_MAPPINGS_ONLY

        public static final int IX_MIN_YES_NO_MAPPINGS_ONLY
        Mappings only in [minYesNoMappingsOnly..minNoNo[.
        See Also:
        Constant Field Values
      • IX_MIN_NO_NO_COMP_BOUNDARY_BEFORE

        public static final int IX_MIN_NO_NO_COMP_BOUNDARY_BEFORE
        Mappings are not comp-normalized but have a comp boundary before.
        See Also:
        Constant Field Values
      • IX_MIN_NO_NO_COMP_NO_MAYBE_CC

        public static final int IX_MIN_NO_NO_COMP_NO_MAYBE_CC
        Mappings do not have a comp boundary before.
        See Also:
        Constant Field Values
      • IX_MIN_NO_NO_EMPTY

        public static final int IX_MIN_NO_NO_EMPTY
        Mappings to the empty string.
        See Also:
        Constant Field Values
      • IX_MIN_MAYBE_NO

        public static final int IX_MIN_MAYBE_NO
        Two-way mappings; each starts with a character that combines backward.
        See Also:
        Constant Field Values
      • IX_MIN_MAYBE_NO_COMBINES_FWD

        public static final int IX_MIN_MAYBE_NO_COMBINES_FWD
        Two-way mappings & compositions.
        See Also:
        Constant Field Values
      • MAPPING_HAS_CCC_LCCC_WORD

        public static final int MAPPING_HAS_CCC_LCCC_WORD
        See Also:
        Constant Field Values
      • MAPPING_HAS_RAW_MAPPING

        public static final int MAPPING_HAS_RAW_MAPPING
        See Also:
        Constant Field Values
      • minDecompNoCP

        private int minDecompNoCP
      • minCompNoMaybeCP

        private int minCompNoMaybeCP
      • minLcccCP

        private int minLcccCP
      • minYesNo

        private int minYesNo
      • minYesNoMappingsOnly

        private int minYesNoMappingsOnly
      • minNoNo

        private int minNoNo
      • minNoNoCompBoundaryBefore

        private int minNoNoCompBoundaryBefore
      • minNoNoCompNoMaybeCC

        private int minNoNoCompNoMaybeCC
      • minNoNoEmpty

        private int minNoNoEmpty
      • limitNoNo

        private int limitNoNo
      • centerNoNoDelta

        private int centerNoNoDelta
      • minMaybeNo

        private int minMaybeNo
      • minMaybeNoCombinesFwd

        private int minMaybeNoCombinesFwd
      • minMaybeYes

        private int minMaybeYes
      • extraData

        private java.lang.String extraData
      • smallFCD

        private byte[] smallFCD
      • canonStartSets

        private java.util.ArrayList<UnicodeSet> canonStartSets
      • CANON_NOT_SEGMENT_STARTER

        private static final int CANON_NOT_SEGMENT_STARTER
        See Also:
        Constant Field Values
      • CANON_HAS_COMPOSITIONS

        private static final int CANON_HAS_COMPOSITIONS
        See Also:
        Constant Field Values
    • Constructor Detail

      • Normalizer2Impl

        public Normalizer2Impl()
    • Method Detail

      • addLcccChars

        public void addLcccChars​(UnicodeSet set)
      • addPropertyStarts

        public void addPropertyStarts​(UnicodeSet set)
      • addCanonIterPropertyStarts

        public void addCanonIterPropertyStarts​(UnicodeSet set)
      • getNorm16

        public int getNorm16​(int c)
      • getRawNorm16

        public int getRawNorm16​(int c)
      • getCompQuickCheck

        public int getCompQuickCheck​(int norm16)
      • isAlgorithmicNoNo

        public boolean isAlgorithmicNoNo​(int norm16)
      • isCompNo

        public boolean isCompNo​(int norm16)
      • isDecompYes

        public boolean isDecompYes​(int norm16)
      • getCC

        public int getCC​(int norm16)
      • getCCFromNormalYesOrMaybe

        public static int getCCFromNormalYesOrMaybe​(int norm16)
      • getCCFromYesOrMaybeYes

        public static int getCCFromYesOrMaybeYes​(int norm16)
      • getCCFromYesOrMaybeYesCP

        public int getCCFromYesOrMaybeYesCP​(int c)
      • getFCD16

        public int getFCD16​(int c)
        Returns the FCD data for code point c.
        Parameters:
        c - A Unicode code point.
        Returns:
        The lccc(c) in bits 15..8 and tccc(c) in bits 7..0.
      • singleLeadMightHaveNonZeroFCD16

        public boolean singleLeadMightHaveNonZeroFCD16​(int lead)
        Returns true if the single-or-lead code unit c might have non-zero FCD data.
      • getFCD16FromNormData

        public int getFCD16FromNormData​(int c)
        Gets the FCD value from the regular normalization data.
      • getFCD16FromMaybeOrNonZeroCC

        private int getFCD16FromMaybeOrNonZeroCC​(int norm16)
      • getDecomposition

        public java.lang.String getDecomposition​(int c)
        Gets the decomposition for one code point.
        Parameters:
        c - code point
        Returns:
        c's decomposition, if it has one; returns null if it does not have a decomposition
      • getRawDecomposition

        public java.lang.String getRawDecomposition​(int c)
        Gets the raw decomposition for one code point.
        Parameters:
        c - code point
        Returns:
        c's raw decomposition, if it has one; returns null if it does not have a decomposition
      • isCanonSegmentStarter

        public boolean isCanonSegmentStarter​(int c)
        Returns true if code point c starts a canonical-iterator string segment. ensureCanonIterData() must have been called before this method, or else this method will crash.
        Parameters:
        c - A Unicode code point.
        Returns:
        true if c starts a canonical-iterator string segment.
      • getCanonStartSet

        public boolean getCanonStartSet​(int c,
                                        UnicodeSet set)
        Returns true if there are characters whose decomposition starts with c. If so, then the set is cleared and then filled with those characters. ensureCanonIterData() must have been called before this method, or else this method will crash.
        Parameters:
        c - A Unicode code point.
        set - A UnicodeSet to receive the characters whose decompositions start with c, if there are any.
        Returns:
        true if there are characters whose decomposition starts with c.
      • decompose

        public java.lang.Appendable decompose​(java.lang.CharSequence s,
                                              java.lang.StringBuilder dest)
      • decompose

        public void decompose​(java.lang.CharSequence s,
                              int src,
                              int limit,
                              java.lang.StringBuilder dest,
                              int destLengthEstimate)
        Decomposes s[src, limit[ and writes the result to dest. limit can be NULL if src is NUL-terminated. destLengthEstimate is the initial dest buffer capacity and can be -1.
      • compose

        public boolean compose​(java.lang.CharSequence s,
                               int src,
                               int limit,
                               boolean onlyContiguous,
                               boolean doCompose,
                               Normalizer2Impl.ReorderingBuffer buffer)
      • composeQuickCheck

        public int composeQuickCheck​(java.lang.CharSequence s,
                                     int src,
                                     int limit,
                                     boolean onlyContiguous,
                                     boolean doSpan)
        Very similar to compose(): Make the same changes in both places if relevant. doSpan: spanQuickCheckYes (ignore bit 0 of the return value) !doSpan: quickCheck
        Returns:
        bits 31..1: spanQuickCheckYes (==s.length() if "yes") and bit 0: set if "maybe"; otherwise, if the span length<s.length() then the quick check result is "no"
      • composeAndAppend

        public void composeAndAppend​(java.lang.CharSequence s,
                                     boolean doCompose,
                                     boolean onlyContiguous,
                                     Normalizer2Impl.ReorderingBuffer buffer)
      • hasDecompBoundaryBefore

        public boolean hasDecompBoundaryBefore​(int c)
      • norm16HasDecompBoundaryBefore

        public boolean norm16HasDecompBoundaryBefore​(int norm16)
      • hasDecompBoundaryAfter

        public boolean hasDecompBoundaryAfter​(int c)
      • norm16HasDecompBoundaryAfter

        public boolean norm16HasDecompBoundaryAfter​(int norm16)
      • isDecompInert

        public boolean isDecompInert​(int c)
      • hasCompBoundaryBefore

        public boolean hasCompBoundaryBefore​(int c)
      • hasCompBoundaryAfter

        public boolean hasCompBoundaryAfter​(int c,
                                            boolean onlyContiguous)
      • isCompInert

        public boolean isCompInert​(int c,
                                   boolean onlyContiguous)
      • hasFCDBoundaryBefore

        public boolean hasFCDBoundaryBefore​(int c)
      • hasFCDBoundaryAfter

        public boolean hasFCDBoundaryAfter​(int c)
      • isFCDInert

        public boolean isFCDInert​(int c)
      • isMaybe

        private boolean isMaybe​(int norm16)
      • isMaybeYesOrNonZeroCC

        private boolean isMaybeYesOrNonZeroCC​(int norm16)
      • isInert

        private static boolean isInert​(int norm16)
      • isJamoL

        private static boolean isJamoL​(int norm16)
      • isJamoVT

        private static boolean isJamoVT​(int norm16)
      • hangulLVT

        private int hangulLVT()
      • isHangulLV

        private boolean isHangulLV​(int norm16)
      • isHangulLVT

        private boolean isHangulLVT​(int norm16)
      • isCompYesAndZeroCC

        private boolean isCompYesAndZeroCC​(int norm16)
      • isDecompYesAndZeroCC

        private boolean isDecompYesAndZeroCC​(int norm16)
      • isMostDecompYesAndZeroCC

        private boolean isMostDecompYesAndZeroCC​(int norm16)
        A little faster and simpler than isDecompYesAndZeroCC() but does not include the MaybeYes which combine-forward and have ccc=0.
      • isDecompNoAlgorithmic

        private boolean isDecompNoAlgorithmic​(int norm16)
        Since formatVersion 5: same as isAlgorithmicNoNo()
      • getCCFromNoNo

        private int getCCFromNoNo​(int norm16)
      • getTrailCCFromCompYesAndZeroCC

        int getTrailCCFromCompYesAndZeroCC​(int norm16)
      • mapAlgorithmic

        private int mapAlgorithmic​(int c,
                                   int norm16)
      • getDataForYesOrNo

        private int getDataForYesOrNo​(int norm16)
      • getDataForMaybe

        private int getDataForMaybe​(int norm16)
      • getData

        private int getData​(int norm16)
      • getCompositionsListForDecompYes

        private int getCompositionsListForDecompYes​(int norm16)
        Returns:
        index into extraData, or -1
      • getCompositionsListForComposite

        private int getCompositionsListForComposite​(int norm16)
        Returns:
        index into maybeYesCompositions
      • getCompositionsList

        private int getCompositionsList​(int norm16)
        Parameters:
        c - code point must have compositions
        Returns:
        index into maybeYesCompositions
      • decomposeShort

        private int decomposeShort​(java.lang.CharSequence s,
                                   int src,
                                   int limit,
                                   boolean stopAtCompBoundary,
                                   boolean onlyContiguous,
                                   Normalizer2Impl.ReorderingBuffer buffer)
      • combine

        private int combine​(int list,
                            int trail)
        Finds the recomposition result for a forward-combining "lead" character, specified with a pointer to its compositions list, and a backward-combining "trail" character.

        If the lead and trail characters combine, then this function returns the following "compositeAndFwd" value:

         Bits 21..1  composite character
         Bit      0  set if the composite is a forward-combining starter
         
        otherwise it returns -1.

        The compositions list has (trail, compositeAndFwd) pair entries, encoded as either pairs or triples of 16-bit units. The last entry has the high bit of its first unit set.

        The list is sorted by ascending trail characters (there are no duplicates). A linear search is used.

        See normalizer2impl.h for a more detailed description of the compositions list format.

      • addComposites

        private void addComposites​(int list,
                                   UnicodeSet set)
        Parameters:
        list - some character's compositions list
        set - recursively receives the composites from these compositions
      • composePair

        public int composePair​(int a,
                               int b)
      • hasCompBoundaryBefore

        private boolean hasCompBoundaryBefore​(int c,
                                              int norm16)
        Does c have a composition boundary before it? True if its decomposition begins with a character that has ccc=0 && NFC_QC=Yes (isCompYesAndZeroCC()). As a shortcut, this is true if c itself has ccc=0 && NFC_QC=Yes (isCompYesAndZeroCC()) so we need not decompose.
      • norm16HasCompBoundaryBefore

        private boolean norm16HasCompBoundaryBefore​(int norm16)
      • hasCompBoundaryBefore

        private boolean hasCompBoundaryBefore​(java.lang.CharSequence s,
                                              int src,
                                              int limit)
      • norm16HasCompBoundaryAfter

        private boolean norm16HasCompBoundaryAfter​(int norm16,
                                                   boolean onlyContiguous)
      • hasCompBoundaryAfter

        private boolean hasCompBoundaryAfter​(java.lang.CharSequence s,
                                             int start,
                                             int p,
                                             boolean onlyContiguous)
      • isTrailCC01ForCompBoundaryAfter

        private boolean isTrailCC01ForCompBoundaryAfter​(int norm16)
        For FCC: Given norm16 HAS_COMP_BOUNDARY_AFTER, does it have tccc<=1?
      • findPreviousCompBoundary

        private int findPreviousCompBoundary​(java.lang.CharSequence s,
                                             int p,
                                             boolean onlyContiguous)
      • findNextCompBoundary

        private int findNextCompBoundary​(java.lang.CharSequence s,
                                         int p,
                                         int limit,
                                         boolean onlyContiguous)
      • findPreviousFCDBoundary

        private int findPreviousFCDBoundary​(java.lang.CharSequence s,
                                            int p)
      • findNextFCDBoundary

        private int findNextFCDBoundary​(java.lang.CharSequence s,
                                        int p,
                                        int limit)
      • getPreviousTrailCC

        private int getPreviousTrailCC​(java.lang.CharSequence s,
                                       int start,
                                       int p)
      • addToStartSet

        private void addToStartSet​(MutableCodePointTrie mutableTrie,
                                   int origin,
                                   int decompLead)