Package com.ibm.icu.dev.tool.translit
Class UnicodeSetCloseOver
- java.lang.Object
-
- com.ibm.icu.dev.tool.translit.UnicodeSetCloseOver
-
class UnicodeSetCloseOver extends java.lang.Object
This class produces the data tables used by the closeOver() method of UnicodeSet. Whenever the Unicode database changes, this tool must be re-run (AFTER the data file(s) underlying ICU4J are udpated). The output of this tool should then be pasted into the appropriate files: ICU4J: com.ibm.icu.text.UnicodeSet.java ICU4C: /icu/source/common/uniset.cpp
-
-
Field Summary
Fields Modifier and Type Field Description (package private) static java.lang.String
C_SET_OUT
(package private) static java.lang.String
C_UCHAR_OUT
(package private) static boolean
DEFAULT_CASE_MAP
(package private) static java.lang.String
JAVA_CHARPROP_OUT
(package private) static java.lang.String
JAVA_OUT
(package private) static java.lang.String
WARNING
-
Constructor Summary
Constructors Constructor Description UnicodeSetCloseOver()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description (package private) static void
analyzeCaseData(java.util.Map equivClasses, java.lang.StringBuffer pairs, java.util.Vector nonpairs, java.util.Vector lengths)
Analyze the case fold equivalency classes.(package private) static java.util.Map
createCaseFoldEquivalencyClasses()
Create a map of String => Set.(package private) static void
emitRangesString(java.io.PrintStream out, UnicodeSet set, java.lang.String id)
Given a UnicodeSet, emit it as a Java string.(package private) static void
emitUCharRangesArray(java.io.PrintStream out, UnicodeSet set, java.lang.String id)
Given a UnicodeSet, emit it as an array of UChar pairs.(package private) static void
generateCaseData()
(package private) static UnicodeSet
getCaseSensitive()
Create the set of case-sensitive characters.static void
main(java.lang.String[] args)
-
-
-
Field Detail
-
JAVA_OUT
static final java.lang.String JAVA_OUT
- See Also:
- Constant Field Values
-
JAVA_CHARPROP_OUT
static final java.lang.String JAVA_CHARPROP_OUT
- See Also:
- Constant Field Values
-
C_SET_OUT
static final java.lang.String C_SET_OUT
- See Also:
- Constant Field Values
-
C_UCHAR_OUT
static final java.lang.String C_UCHAR_OUT
- See Also:
- Constant Field Values
-
WARNING
static final java.lang.String WARNING
-
DEFAULT_CASE_MAP
static final boolean DEFAULT_CASE_MAP
- See Also:
- Constant Field Values
-
-
Method Detail
-
main
public static void main(java.lang.String[] args) throws java.io.IOException
- Throws:
java.io.IOException
-
createCaseFoldEquivalencyClasses
static java.util.Map createCaseFoldEquivalencyClasses()
Create a map of String => Set. The String in this case is a folded string for which UCharacter.foldCase(folded. DEFAULT_CASE_MAP).equals(folded). The Set contains all single-character strings x for which UCharacter.foldCase(x, DEFAULT_CASE_MAP).equals(folded), as well as folded itself.
-
analyzeCaseData
static void analyzeCaseData(java.util.Map equivClasses, java.lang.StringBuffer pairs, java.util.Vector nonpairs, java.util.Vector lengths)
Analyze the case fold equivalency classes. Break them into two groups: 'pairs', and 'nonpairs'. Create a tally of the length configurations of the nonpairs. Length configurations of equivalency classes, as of Unicode 3.2. Most of the classes (83%) have two single codepoints. Here "112:28" means there are 28 equivalency classes with 2 single codepoints and one string of length 2. 11:656 111:16 1111:3 112:28 113:2 12:31 13:12 22:38 Note: This method does not count the frequencies of the different length configurations (as shown above after ':'); it merely records which configurations occur.- Parameters:
pairs
- Accumulate equivalency classes that consist of exactly two codepoints here. This is 83+% of the classes. E.g., {"a", "A"}.nonpairs
- Accumulate other equivalency classes here, as lists of strings. E,g, {"st", "ſt", "st"}.lengths
- Accumulate a list of unique length structures, not including pairs. Each length structure is represented by a string of digits. The digit string "12" means the equivalency class contains a single code point and a string of length 2. Typical contents of 'lengths': { "111", "1111", "112", "113", "12", "13", "22" }. Note the absence of "11".
-
generateCaseData
static void generateCaseData() throws java.io.IOException
- Throws:
java.io.IOException
-
getCaseSensitive
static UnicodeSet getCaseSensitive()
Create the set of case-sensitive characters. These are characters that participate in any case mapping operation as a source or as a member of a target string.
-
emitUCharRangesArray
static void emitUCharRangesArray(java.io.PrintStream out, UnicodeSet set, java.lang.String id)
Given a UnicodeSet, emit it as an array of UChar pairs. Each pair will be the start/end of a range. Code points >= U+10000 will be represented as surrogate pairs.
-
emitRangesString
static void emitRangesString(java.io.PrintStream out, UnicodeSet set, java.lang.String id)
Given a UnicodeSet, emit it as a Java string. The most economical format is not the pattern, but instead a pairs list, with each range pair represented as two adjacent characters.
-
-