Class UnicodeSetCloseOver


  • class UnicodeSetCloseOver
    extends java.lang.Object
    This class produces the data tables used by the closeOver() method of UnicodeSet. Whenever the Unicode database changes, this tool must be re-run (AFTER the data file(s) underlying ICU4J are udpated). The output of this tool should then be pasted into the appropriate files: ICU4J: com.ibm.icu.text.UnicodeSet.java ICU4C: /icu/source/common/uniset.cpp
    • Field Summary

      Fields 
      Modifier and Type Field Description
      (package private) static java.lang.String C_SET_OUT  
      (package private) static java.lang.String C_UCHAR_OUT  
      (package private) static boolean DEFAULT_CASE_MAP  
      (package private) static java.lang.String JAVA_CHARPROP_OUT  
      (package private) static java.lang.String JAVA_OUT  
      (package private) static java.lang.String WARNING  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      (package private) static void analyzeCaseData​(java.util.Map equivClasses, java.lang.StringBuffer pairs, java.util.Vector nonpairs, java.util.Vector lengths)
      Analyze the case fold equivalency classes.
      (package private) static java.util.Map createCaseFoldEquivalencyClasses()
      Create a map of String => Set.
      (package private) static void emitRangesString​(java.io.PrintStream out, UnicodeSet set, java.lang.String id)
      Given a UnicodeSet, emit it as a Java string.
      (package private) static void emitUCharRangesArray​(java.io.PrintStream out, UnicodeSet set, java.lang.String id)
      Given a UnicodeSet, emit it as an array of UChar pairs.
      (package private) static void generateCaseData()  
      (package private) static UnicodeSet getCaseSensitive()
      Create the set of case-sensitive characters.
      static void main​(java.lang.String[] args)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • UnicodeSetCloseOver

        UnicodeSetCloseOver()
    • Method Detail

      • main

        public static void main​(java.lang.String[] args)
                         throws java.io.IOException
        Throws:
        java.io.IOException
      • createCaseFoldEquivalencyClasses

        static java.util.Map createCaseFoldEquivalencyClasses()
        Create a map of String => Set. The String in this case is a folded string for which UCharacter.foldCase(folded. DEFAULT_CASE_MAP).equals(folded). The Set contains all single-character strings x for which UCharacter.foldCase(x, DEFAULT_CASE_MAP).equals(folded), as well as folded itself.
      • analyzeCaseData

        static void analyzeCaseData​(java.util.Map equivClasses,
                                    java.lang.StringBuffer pairs,
                                    java.util.Vector nonpairs,
                                    java.util.Vector lengths)
        Analyze the case fold equivalency classes. Break them into two groups: 'pairs', and 'nonpairs'. Create a tally of the length configurations of the nonpairs. Length configurations of equivalency classes, as of Unicode 3.2. Most of the classes (83%) have two single codepoints. Here "112:28" means there are 28 equivalency classes with 2 single codepoints and one string of length 2. 11:656 111:16 1111:3 112:28 113:2 12:31 13:12 22:38 Note: This method does not count the frequencies of the different length configurations (as shown above after ':'); it merely records which configurations occur.
        Parameters:
        pairs - Accumulate equivalency classes that consist of exactly two codepoints here. This is 83+% of the classes. E.g., {"a", "A"}.
        nonpairs - Accumulate other equivalency classes here, as lists of strings. E,g, {"st", "ſt", "st"}.
        lengths - Accumulate a list of unique length structures, not including pairs. Each length structure is represented by a string of digits. The digit string "12" means the equivalency class contains a single code point and a string of length 2. Typical contents of 'lengths': { "111", "1111", "112", "113", "12", "13", "22" }. Note the absence of "11".
      • generateCaseData

        static void generateCaseData()
                              throws java.io.IOException
        Throws:
        java.io.IOException
      • getCaseSensitive

        static UnicodeSet getCaseSensitive()
        Create the set of case-sensitive characters. These are characters that participate in any case mapping operation as a source or as a member of a target string.
      • emitUCharRangesArray

        static void emitUCharRangesArray​(java.io.PrintStream out,
                                         UnicodeSet set,
                                         java.lang.String id)
        Given a UnicodeSet, emit it as an array of UChar pairs. Each pair will be the start/end of a range. Code points >= U+10000 will be represented as surrogate pairs.
      • emitRangesString

        static void emitRangesString​(java.io.PrintStream out,
                                     UnicodeSet set,
                                     java.lang.String id)
        Given a UnicodeSet, emit it as a Java string. The most economical format is not the pattern, but instead a pairs list, with each range pair represented as two adjacent characters.