Class UnicodeSetStringSpan

java.lang.Object
com.ibm.icu.impl.UnicodeSetStringSpan

public class UnicodeSetStringSpan extends Object
  • Field Details

    • WITH_COUNT

      public static final int WITH_COUNT
      See Also:
    • FWD

      public static final int FWD
      See Also:
    • BACK

      public static final int BACK
      See Also:
    • CONTAINED

      public static final int CONTAINED
      See Also:
    • NOT_CONTAINED

      public static final int NOT_CONTAINED
      See Also:
    • ALL

      public static final int ALL
      See Also:
    • FWD_UTF16_CONTAINED

      public static final int FWD_UTF16_CONTAINED
      See Also:
    • FWD_UTF16_NOT_CONTAINED

      public static final int FWD_UTF16_NOT_CONTAINED
      See Also:
    • BACK_UTF16_CONTAINED

      public static final int BACK_UTF16_CONTAINED
      See Also:
    • BACK_UTF16_NOT_CONTAINED

      public static final int BACK_UTF16_NOT_CONTAINED
      See Also:
    • ALL_CP_CONTAINED

      static final short ALL_CP_CONTAINED
      Special spanLength short values. (since Java has not unsigned byte type) All code points in the string are contained in the parent set.
      See Also:
    • LONG_SPAN

      static final short LONG_SPAN
      The spanLength is >=0xfe.
      See Also:
    • spanSet

      private UnicodeSet spanSet
      Set for span(). Same as parent but without strings.
    • spanNotSet

      private UnicodeSet spanNotSet
      Set for span(not contained). Same as spanSet, plus characters that start or end strings.
    • strings

      private ArrayList<String> strings
      The strings of the parent set.
    • spanLengths

      private short[] spanLengths
      The lengths of span(), spanBack() etc. for each string.
    • maxLength16

      private final int maxLength16
      Maximum lengths of relevant strings.
    • someRelevant

      private boolean someRelevant
      Are there strings that are not fully contained in the code point set?
    • all

      private boolean all
      Set up for all variants of span()?
    • offsets

      Span helper
  • Constructor Details

    • UnicodeSetStringSpan

      public UnicodeSetStringSpan(UnicodeSet set, ArrayList<String> setStrings, int which)
      Constructs for all variants of span(), or only for any one variant. Initializes as little as possible, for single use.
    • UnicodeSetStringSpan

      public UnicodeSetStringSpan(UnicodeSetStringSpan otherStringSpan, ArrayList<String> newParentSetStrings)
      Constructs a copy of an existing UnicodeSetStringSpan. Assumes which==ALL for a frozen set.
  • Method Details

    • needsStringSpanUTF16

      public boolean needsStringSpanUTF16()
      Do the strings need to be checked in span() etc.?
      Returns:
      true if strings need to be checked (call span() here), false if not (use a BMPSet for best performance).
    • contains

      public boolean contains(int c)
      For fast UnicodeSet::contains(c).
    • addToSpanNotSet

      private void addToSpanNotSet(int c)
      Adds a starting or ending string character to the spanNotSet so that a character span ends before any string.
    • span

      public int span(CharSequence s, int start, UnicodeSet.SpanCondition spanCondition)
      Spans a string.
      Parameters:
      s - The string to be spanned
      start - The start index that the span begins
      spanCondition - The span condition
      Returns:
      the limit (exclusive end) of the span
    • spanWithStrings

      private int spanWithStrings(CharSequence s, int start, int spanLimit, UnicodeSet.SpanCondition spanCondition)
      Synchronized method for complicated spans using the offsets. Avoids synchronization for simple cases.
      Parameters:
      spanLimit - = spanSet.span(s, start, CONTAINED)
    • spanAndCount

      public int spanAndCount(CharSequence s, int start, UnicodeSet.SpanCondition spanCondition, OutputInt outCount)
      Spans a string and counts the smallest number of set elements on any path across the span.

      For proper counting, we cannot ignore strings that are fully contained in code point spans.

      If the set does not have any fully-contained strings, then we could optimize this like span(), but such sets are likely rare, and this is at least still linear.

      Parameters:
      s - The string to be spanned
      start - The start index that the span begins
      spanCondition - The span condition
      outCount - The count
      Returns:
      the limit (exclusive end) of the span
    • spanContainedAndCount

      private int spanContainedAndCount(CharSequence s, int start, OutputInt outCount)
    • spanBack

      public int spanBack(CharSequence s, int length, UnicodeSet.SpanCondition spanCondition)
      Span a string backwards.
      Parameters:
      s - The string to be spanned
      spanCondition - The span condition
      Returns:
      The string index which starts the span (i.e. inclusive).
    • spanNot

      private int spanNot(CharSequence s, int start, OutputInt outCount)
      Algorithm for spanNot()==span(SpanCondition.NOT_CONTAINED) Theoretical algorithm: - Iterate through the string, and at each code point boundary: + If the code point there is in the set, then return with the current position. + If a set string matches at the current position, then return with the current position. Optimized implementation: (Same assumption as for span() above.) Create and cache a spanNotSet which contains all of the single code points of the original set but none of its strings. For each set string add its initial code point to the spanNotSet. (Also add its final code point for spanNotBack().) - Loop: + Do spanLength=spanNotSet.span(SpanCondition.NOT_CONTAINED). + If the current code point is in the original set, then return the current position. + If any set string matches at the current position, then return the current position. + If there is no match at the current position, neither for the code point there nor for any set string, then skip this code point and continue the loop. This happens for set-string-initial code points that were added to spanNotSet when there is not actually a match for such a set string.
      Parameters:
      s - The string to be spanned
      start - The start index that the span begins
      outCount - If not null: Receives the number of code points across the span.
      Returns:
      the limit (exclusive end) of the span
    • spanNotBack

      private int spanNotBack(CharSequence s, int length)
    • makeSpanLengthByte

      static short makeSpanLengthByte(int spanLength)
    • matches16

      private static boolean matches16(CharSequence s, int start, String t, int length)
    • matches16CPB

      static boolean matches16CPB(CharSequence s, int start, int limit, String t, int tlength)
      Compare 16-bit Unicode strings (which may be malformed UTF-16) at code point boundaries. That is, each edge of a match must not be in the middle of a surrogate pair.
      Parameters:
      s - The string to match in.
      start - The start index of s.
      limit - The limit of the subsequence of s being spanned.
      t - The substring to be matched in s.
      tlength - The length of t.
    • spanOne

      static int spanOne(UnicodeSet set, CharSequence s, int start, int length)
      Does the set contain the next code point? If so, return its length; otherwise return its negative length.
    • spanOneBack

      static int spanOneBack(UnicodeSet set, CharSequence s, int length)