Class StringSearch
- java.lang.Object
-
- com.ibm.icu.text.SearchIterator
-
- com.ibm.icu.text.StringSearch
-
public final class StringSearch extends SearchIterator
StringSearch is aSearchIterator
that provides language-sensitive text searching based on the comparison rules defined in aRuleBasedCollator
object. StringSearch ensures that language eccentricity can be handled, e.g. for the German collator, characters ß and SS will be matched if case is chosen to be ignored. See the "ICU Collation Design Document" for more information.There are 2 match options for selection:
Let S' be the sub-string of a text string S between the offsets start and end [start, end].
A pattern string P matches a text string S at the offsets [start, end] ifoption 1. Some canonical equivalent of P matches some canonical equivalent of S' option 2. P matches S' and if P starts or ends with a combining mark, there exists no non-ignorable combining mark before or after S? in S respectively.
Option 2. is the default.This search has APIs similar to that of other text iteration mechanisms such as the break iterators in
BreakIterator
. Using these APIs, it is easy to scan through text looking for all occurrences of a given pattern. This search iterator allows changing of direction by calling areset()
followed by aSearchIterator.next()
orSearchIterator.previous()
. Though a direction change can occur without callingreset()
first, this operation comes with some speed penalty. Match results in the forward direction will match the result matches in the backwards direction in the reverse orderSearchIterator
provides APIs to specify the starting position within the text string to be searched, e.g.setIndex
,preceding
andfollowing
. Since the starting position will be set as it is specified, please take note that there are some danger points at which the search may render incorrect results:- In the midst of a substring that requires normalization.
- If the following match is to be found, the position should not be the second character which requires swapping with the preceding character. Vice versa, if the preceding match is to be found, the position to search from should not be the first character which requires swapping with the next character. E.g certain Thai and Lao characters require swapping.
- If a following pattern match is to be found, any position within a contracting sequence except the first will fail. Vice versa if a preceding pattern match is to be found, an invalid starting point would be any character within a contracting sequence except the last.
A
BreakIterator
can be used if only matches at logical breaks are desired. Using aBreakIterator
will only give you results that exactly matches the boundaries given by theBreakIterator
. For instance the pattern "e" will not be found in the string "é" if a character break iterator is used.Options are provided to handle overlapping matches. E.g. In English, overlapping matches produces the result 0 and 2 for the pattern "abab" in the text "ababab", where mutually exclusive matches only produces the result of 0.
Options are also provided to implement "asymmetric search" as described in UTS #10 Unicode Collation Algorithm, specifically the ElementComparisonType values.
Though collator attributes will be taken into consideration while performing matches, there are no APIs here for setting and getting the attributes. These attributes can be set by getting the collator from
getCollator()
and using the APIs inRuleBasedCollator
. Lastly to update StringSearch to the new collator attributes,reset()
has to be called.Restriction:
Currently there are no composite characters that consists of a character with combining class > 0 before a character with combining class == 0. However, if such a character exists in the future, StringSearch does not guarantee the results for option 1.Consult the
SearchIterator
documentation for information on and examples of how to use instances of this class to implement text searching.Note, StringSearch is not to be subclassed.
- See Also:
SearchIterator
,RuleBasedCollator
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
StringSearch.CEBuffer
CEBuffer A circular buffer of CEs from the text being searchedprivate static class
StringSearch.CEI
Java port of ICU4C CEI (usearch.cpp) CEI Collation Element + source text index.private static class
StringSearch.CollationPCE
Java port of ICU4C UCollationPCE (usrchimp.h)private static class
StringSearch.Match
An object used for receiving matched index in search() and searchBackwards().private static class
StringSearch.Pattern
Java port of ICU4C struct UPattern (usrchimp.h)-
Nested classes/interfaces inherited from class com.ibm.icu.text.SearchIterator
SearchIterator.ElementComparisonType, SearchIterator.Search
-
-
Field Summary
Fields Modifier and Type Field Description private static int
CE_LEVEL2_BASE
private static int
CE_LEVEL3_BASE
private static int
CE_MATCH
private static int
CE_NO_MATCH
private static int
CE_SKIP_PATN
private static int
CE_SKIP_TARG
(package private) int
ceMask_
private RuleBasedCollator
collator_
private static int
INITIAL_ARRAY_SIZE_
private Normalizer2
nfd_
private StringSearch.Pattern
pattern_
private static int
PRIMARYORDERMASK
private static int
SECONDARYORDERMASK
private int
strength_
private static int
TERTIARYORDERMASK
private CollationElementIterator
textIter_
private StringSearch.CollationPCE
textProcessedIter_
private boolean
toShift_
private CollationElementIterator
utilIter_
(package private) int
variableTop_
-
Fields inherited from class com.ibm.icu.text.SearchIterator
breakIterator, DONE, matchLength, search_, targetText
-
-
Constructor Summary
Constructors Constructor Description StringSearch(java.lang.String pattern, java.lang.String target)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the default locale to search for argument pattern in the argument target text.StringSearch(java.lang.String pattern, java.text.CharacterIterator target, RuleBasedCollator collator)
Initializes the iterator to use the language-specific rules defined in the argument collator to search for argument pattern in the argument target text.StringSearch(java.lang.String pattern, java.text.CharacterIterator target, RuleBasedCollator collator, BreakIterator breakiter)
Initializes the iterator to use the language-specific rules defined in the argument collator to search for argument pattern in the argument target text.StringSearch(java.lang.String pattern, java.text.CharacterIterator target, ULocale locale)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the argument locale to search for argument pattern in the argument target text.StringSearch(java.lang.String pattern, java.text.CharacterIterator target, java.util.Locale locale)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the argument locale to search for argument pattern in the argument target text.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description private static int[]
addToIntArray(int[] destination, int offset, int value, int increments)
Direct port of ICU4C static int32_t * addTouint32_tArray(...) in usearch.cpp (except not taking destination buffer size and status param).private static long[]
addToLongArray(long[] destination, int offset, int destinationlength, long value, int increments)
Direct port of ICU4C static int64_t * addTouint64_tArray(...) in usearch.cpp.private boolean
checkIdentical(int start, int end)
Checks for identical matchprivate static int
codePointAt(java.text.CharacterIterator iter, int index)
private static int
codePointBefore(java.text.CharacterIterator iter, int index)
private static int
compareCE64s(long targCE, long patCE, SearchIterator.ElementComparisonType compareType)
private int
getCE(int sourcece)
Getting the modified collation elements taking into account the collation attributes.RuleBasedCollator
getCollator()
Gets theRuleBasedCollator
used for the language rules.int
getIndex()
Return the current index in the text being searched.private static int
getMask(int strength)
Getting the mask for collation strengthjava.lang.String
getPattern()
Returns the pattern for which StringSearch is searching for.private static java.lang.String
getString(java.text.CharacterIterator text, int start, int length)
Gets a substring out of a CharacterIterator Java porting note: Not available in ICU4Cprotected int
handleNext(int position)
Abstract method which subclasses override to provide the mechanism for finding the next match in the target text.private boolean
handleNextCanonical()
private boolean
handleNextCommonImpl()
private boolean
handleNextExact()
protected int
handlePrevious(int position)
Abstract method which subclasses override to provide the mechanism for finding the previous match in the target text.private boolean
handlePreviousCanonical()
private boolean
handlePreviousCommonImpl()
private boolean
handlePreviousExact()
private void
initialize()
private int
initializePattern()
private int
initializePatternCETable()
Initializing the ce table for a pattern.private int
initializePatternPCETable()
Initializing the pce table for a pattern.private boolean
initTextProcessedIter()
private boolean
isBreakBoundary(int index)
boolean
isCanonical()
Determines whether canonical matches (option 1, as described in the class documentation) is set.private static boolean
isOutOfBounds(int textstart, int textlimit, int offset)
Checks if the offset runs out of the text string rangeprivate int
nextBoundaryAfter(int startIndex)
void
reset()
Resets the iteration.private boolean
search(int startIdx, StringSearch.Match m)
private boolean
searchBackwards(int startIdx, StringSearch.Match m)
void
setCanonical(boolean allowCanonical)
Set the canonical match mode.void
setCollator(RuleBasedCollator collator)
Sets theRuleBasedCollator
to be used for language-specific searching.void
setIndex(int position)
Sets the position in the target text at which the next search will start.protected void
setMatchNotFound()
Deprecated.This API is ICU internal only.void
setPattern(java.lang.String pattern)
Set the pattern to search for.void
setTarget(java.text.CharacterIterator text)
Set the target text to be searched.-
Methods inherited from class com.ibm.icu.text.SearchIterator
first, following, getBreakIterator, getElementComparisonType, getMatchedText, getMatchLength, getMatchStart, getTarget, isOverlapping, last, next, preceding, previous, setBreakIterator, setElementComparisonType, setMatchLength, setOverlapping
-
-
-
-
Field Detail
-
pattern_
private StringSearch.Pattern pattern_
-
collator_
private RuleBasedCollator collator_
-
textIter_
private CollationElementIterator textIter_
-
textProcessedIter_
private StringSearch.CollationPCE textProcessedIter_
-
utilIter_
private CollationElementIterator utilIter_
-
nfd_
private Normalizer2 nfd_
-
strength_
private int strength_
-
ceMask_
int ceMask_
-
variableTop_
int variableTop_
-
toShift_
private boolean toShift_
-
INITIAL_ARRAY_SIZE_
private static final int INITIAL_ARRAY_SIZE_
- See Also:
- Constant Field Values
-
PRIMARYORDERMASK
private static final int PRIMARYORDERMASK
- See Also:
- Constant Field Values
-
SECONDARYORDERMASK
private static final int SECONDARYORDERMASK
- See Also:
- Constant Field Values
-
TERTIARYORDERMASK
private static final int TERTIARYORDERMASK
- See Also:
- Constant Field Values
-
CE_MATCH
private static final int CE_MATCH
- See Also:
- Constant Field Values
-
CE_NO_MATCH
private static final int CE_NO_MATCH
- See Also:
- Constant Field Values
-
CE_SKIP_TARG
private static final int CE_SKIP_TARG
- See Also:
- Constant Field Values
-
CE_SKIP_PATN
private static final int CE_SKIP_PATN
- See Also:
- Constant Field Values
-
CE_LEVEL2_BASE
private static int CE_LEVEL2_BASE
-
CE_LEVEL3_BASE
private static int CE_LEVEL3_BASE
-
-
Constructor Detail
-
StringSearch
public StringSearch(java.lang.String pattern, java.text.CharacterIterator target, RuleBasedCollator collator, BreakIterator breakiter)
Initializes the iterator to use the language-specific rules defined in the argument collator to search for argument pattern in the argument target text. The argumentbreakiter
is used to define logical matches. See super class documentation for more details on the use of the target text andBreakIterator
.- Parameters:
pattern
- text to look for.target
- target text to search for pattern.collator
-RuleBasedCollator
that defines the language rulesbreakiter
- ABreakIterator
that is used to determine the boundaries of a logical match. This argument can be null.- Throws:
java.lang.IllegalArgumentException
- thrown when argument target is null, or of length 0- See Also:
BreakIterator
,RuleBasedCollator
-
StringSearch
public StringSearch(java.lang.String pattern, java.text.CharacterIterator target, RuleBasedCollator collator)
Initializes the iterator to use the language-specific rules defined in the argument collator to search for argument pattern in the argument target text. NoBreakIterator
s are set to test for logical matches.- Parameters:
pattern
- text to look for.target
- target text to search for pattern.collator
-RuleBasedCollator
that defines the language rules- Throws:
java.lang.IllegalArgumentException
- thrown when argument target is null, or of length 0- See Also:
RuleBasedCollator
-
StringSearch
public StringSearch(java.lang.String pattern, java.text.CharacterIterator target, java.util.Locale locale)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the argument locale to search for argument pattern in the argument target text.- Parameters:
pattern
- text to look for.target
- target text to search for pattern.locale
- locale to use for language and break iterator rules- Throws:
java.lang.IllegalArgumentException
- thrown when argument target is null, or of length 0. ClassCastException thrown if the collator for the specified locale is not a RuleBasedCollator.
-
StringSearch
public StringSearch(java.lang.String pattern, java.text.CharacterIterator target, ULocale locale)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the argument locale to search for argument pattern in the argument target text. See super class documentation for more details on the use of the target text andBreakIterator
.- Parameters:
pattern
- text to look for.target
- target text to search for pattern.locale
- locale to use for language and break iterator rules- Throws:
java.lang.IllegalArgumentException
- thrown when argument target is null, or of length 0. ClassCastException thrown if the collator for the specified locale is not a RuleBasedCollator.- See Also:
BreakIterator
,RuleBasedCollator
,SearchIterator
-
StringSearch
public StringSearch(java.lang.String pattern, java.lang.String target)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the default locale to search for argument pattern in the argument target text.- Parameters:
pattern
- text to look for.target
- target text to search for pattern.- Throws:
java.lang.IllegalArgumentException
- thrown when argument target is null, or of length 0. ClassCastException thrown if the collator for the default locale is not a RuleBasedCollator.
-
-
Method Detail
-
getCollator
public RuleBasedCollator getCollator()
Gets theRuleBasedCollator
used for the language rules.Since StringSearch depends on the returned
RuleBasedCollator
, any changes to theRuleBasedCollator
result should follow with a call to eitherreset()
orsetCollator(RuleBasedCollator)
to ensure the correct search behavior.- Returns:
RuleBasedCollator
used by this StringSearch- See Also:
RuleBasedCollator
,setCollator(com.ibm.icu.text.RuleBasedCollator)
-
setCollator
public void setCollator(RuleBasedCollator collator)
Sets theRuleBasedCollator
to be used for language-specific searching.The iterator's position will not be changed by this method.
- Parameters:
collator
- to use for this StringSearch- Throws:
java.lang.IllegalArgumentException
- thrown when collator is null- See Also:
getCollator()
-
getPattern
public java.lang.String getPattern()
Returns the pattern for which StringSearch is searching for.- Returns:
- the pattern searched for
-
setPattern
public void setPattern(java.lang.String pattern)
Set the pattern to search for. The iterator's position will not be changed by this method.- Parameters:
pattern
- for searching- Throws:
java.lang.IllegalArgumentException
- thrown if pattern is null or of length 0- See Also:
getPattern()
-
isCanonical
public boolean isCanonical()
Determines whether canonical matches (option 1, as described in the class documentation) is set. See setCanonical(boolean) for more information.- Returns:
- true if canonical matches is set, false otherwise
- See Also:
setCanonical(boolean)
-
setCanonical
public void setCanonical(boolean allowCanonical)
Set the canonical match mode. See class documentation for details. The default setting for this property is false.- Parameters:
allowCanonical
- flag indicator if canonical matches are allowed- See Also:
isCanonical()
-
setTarget
public void setTarget(java.text.CharacterIterator text)
Set the target text to be searched. Text iteration will then begin at the start of the text string. This method is useful if you want to reuse an iterator to search within a different body of text.- Overrides:
setTarget
in classSearchIterator
- Parameters:
text
- new text iterator to look for match,- See Also:
SearchIterator.getTarget()
-
getIndex
public int getIndex()
Return the current index in the text being searched. If the iteration has gone past the end of the text (or past the beginning for a backwards search),SearchIterator.DONE
is returned.- Specified by:
getIndex
in classSearchIterator
- Returns:
- current index in the text being searched.
-
setIndex
public void setIndex(int position)
Sets the position in the target text at which the next search will start. This method clears any previous match.
- Overrides:
setIndex
in classSearchIterator
- Parameters:
position
- position from which to start the next search- See Also:
SearchIterator.getIndex()
-
reset
public void reset()
Resets the iteration. Search will begin at the start of the text string if a forward iteration is initiated before a backwards iteration. Otherwise if a backwards iteration is initiated before a forwards iteration, the search will begin at the end of the text string.- Overrides:
reset
in classSearchIterator
-
handleNext
protected int handleNext(int position)
Abstract method which subclasses override to provide the mechanism for finding the next match in the target text. This allows different subclasses to provide different search algorithms.If a match is found, the implementation should return the index at which the match starts and should call
SearchIterator.setMatchLength(int)
with the number of characters in the target text that make up the match. If no match is found, the method should returnSearchIterator.DONE
.- Specified by:
handleNext
in classSearchIterator
- Parameters:
position
- The index in the target text at which the search should start.- Returns:
- index at which the match starts, else if match is not found
SearchIterator.DONE
is returned - See Also:
SearchIterator.setMatchLength(int)
-
handlePrevious
protected int handlePrevious(int position)
Abstract method which subclasses override to provide the mechanism for finding the previous match in the target text. This allows different subclasses to provide different search algorithms.If a match is found, the implementation should return the index at which the match starts and should call
SearchIterator.setMatchLength(int)
with the number of characters in the target text that make up the match. If no match is found, the method should returnSearchIterator.DONE
.- Specified by:
handlePrevious
in classSearchIterator
- Parameters:
position
- The index in the target text at which the search should start.- Returns:
- index at which the match starts, else if match is not found
SearchIterator.DONE
is returned - See Also:
SearchIterator.setMatchLength(int)
-
getMask
private static int getMask(int strength)
Getting the mask for collation strength- Parameters:
strength
- collation strength- Returns:
- collation element mask
-
getCE
private int getCE(int sourcece)
Getting the modified collation elements taking into account the collation attributes.- Parameters:
sourcece
-- Returns:
- the modified collation element
-
addToIntArray
private static int[] addToIntArray(int[] destination, int offset, int value, int increments)
Direct port of ICU4C static int32_t * addTouint32_tArray(...) in usearch.cpp (except not taking destination buffer size and status param). This is used for appending a PCE to Pattern.PCE_ buffer. We probably should implement this in Pattern class.- Parameters:
destination
- target arrayoffset
- destination offset to add valuevalue
- to be addedincrements
- incremental size expected- Returns:
- new destination array, destination if there was no new allocation
-
addToLongArray
private static long[] addToLongArray(long[] destination, int offset, int destinationlength, long value, int increments)
Direct port of ICU4C static int64_t * addTouint64_tArray(...) in usearch.cpp. This is used for appending a PCE to Pattern.PCE_ buffer. We probably should implement this in Pattern class.- Parameters:
destination
- target arrayoffset
- destination offset to add valuedestinationlength
- target array sizevalue
- to be addedincrements
- incremental size expected- Returns:
- new destination array, destination if there was no new allocation
-
initializePatternCETable
private int initializePatternCETable()
Initializing the ce table for a pattern. Stores non-ignorable collation keys. Table size will be estimated by the size of the pattern text. Table expansion will be perform as we go along. Adding 1 to ensure that the table size definitely increases.- Returns:
- total number of expansions
-
initializePatternPCETable
private int initializePatternPCETable()
Initializing the pce table for a pattern. Stores non-ignorable collation keys. Table size will be estimated by the size of the pattern text. Table expansion will be perform as we go along. Adding 1 to ensure that the table size definitely increases.- Returns:
- total number of expansions
-
initializePattern
private int initializePattern()
-
initialize
private void initialize()
-
setMatchNotFound
@Deprecated protected void setMatchNotFound()
Deprecated.This API is ICU internal only.- Overrides:
setMatchNotFound
in classSearchIterator
-
isOutOfBounds
private static final boolean isOutOfBounds(int textstart, int textlimit, int offset)
Checks if the offset runs out of the text string range- Parameters:
textstart
- offset of the first character in the rangetextlimit
- limit offset of the text string rangeoffset
- to test- Returns:
- true if offset is out of bounds, false otherwise
-
checkIdentical
private boolean checkIdentical(int start, int end)
Checks for identical match- Parameters:
start
- offset of possible matchend
- offset of possible match- Returns:
- true if identical match is found
-
initTextProcessedIter
private boolean initTextProcessedIter()
-
nextBoundaryAfter
private int nextBoundaryAfter(int startIndex)
-
isBreakBoundary
private boolean isBreakBoundary(int index)
-
compareCE64s
private static int compareCE64s(long targCE, long patCE, SearchIterator.ElementComparisonType compareType)
-
search
private boolean search(int startIdx, StringSearch.Match m)
-
codePointAt
private static int codePointAt(java.text.CharacterIterator iter, int index)
-
codePointBefore
private static int codePointBefore(java.text.CharacterIterator iter, int index)
-
searchBackwards
private boolean searchBackwards(int startIdx, StringSearch.Match m)
-
handleNextExact
private boolean handleNextExact()
-
handleNextCanonical
private boolean handleNextCanonical()
-
handleNextCommonImpl
private boolean handleNextCommonImpl()
-
handlePreviousExact
private boolean handlePreviousExact()
-
handlePreviousCanonical
private boolean handlePreviousCanonical()
-
handlePreviousCommonImpl
private boolean handlePreviousCommonImpl()
-
getString
private static final java.lang.String getString(java.text.CharacterIterator text, int start, int length)
Gets a substring out of a CharacterIterator Java porting note: Not available in ICU4C- Parameters:
text
- CharacterIteratorstart
- start offsetlength
- of substring- Returns:
- substring from text starting at start and length length
-
-