Class SimplePatternTokenizerFactory


public class SimplePatternTokenizerFactory extends TokenizerFactory
Factory for SimplePatternTokenizer, for matching tokens based on the provided regexp.

This tokenizer uses Lucene RegExp pattern matching to construct distinct tokens for the input stream. The syntax is more limited than PatternTokenizer, but the tokenization is quite a bit faster. It takes two arguments:

  • "pattern" (required) is the regular expression, according to the syntax described at RegExp
  • "determinizeWorkLimit" (optional, default 10000) the limit on total effort spent to determinize the automaton computed from the regexp

The pattern matches the characters to include in a token (not the split characters), and the matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.

For example, to match tokens delimited by simple whitespace characters:

 <fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
   </analyzer>
 </fieldType>
Since:
6.5.0
See Also:
  • Field Details

  • Constructor Details

    • SimplePatternTokenizerFactory

      public SimplePatternTokenizerFactory(Map<String,String> args)
      Creates a new SimplePatternTokenizerFactory
    • SimplePatternTokenizerFactory

      public SimplePatternTokenizerFactory()
      Default ctor for compatibility with SPI
  • Method Details