class Tokkens::Tokenizer

Converts a string to a list of token numbers.

Useful for computing with text, like machine learning. Before using the tokenizer, you're expected to have pre-processed the textdepending on application. For example, converting to lowercase, removing non-word characters, transliterating accented characters.

This class then splits the string into tokens by whitespace, and removes tokens not passing the selection criteria.

Constants

MIN_LENGTH

default minimum token length

STOP_WORDS

no default stop words to ignore

Attributes

min_length[R]

@!attribute [r] tokens

@return [Tokens] object to use for obtaining tokens

@!attribute [r] stop_words

@return [Array<String>] stop words to ignore

@!attribute [r] min_length

@return [Fixnum] Minimum length for tokens
stop_words[R]

@!attribute [r] tokens

@return [Tokens] object to use for obtaining tokens

@!attribute [r] stop_words

@return [Array<String>] stop words to ignore

@!attribute [r] min_length

@return [Fixnum] Minimum length for tokens
tokens[R]

@!attribute [r] tokens

@return [Tokens] object to use for obtaining tokens

@!attribute [r] stop_words

@return [Array<String>] stop words to ignore

@!attribute [r] min_length

@return [Fixnum] Minimum length for tokens

Public Class Methods

new(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS) click to toggle source

Create a new tokenizer

@param tokens [Tokens] object to use for obtaining token numbers @param min_length [Fixnum] minimum length for tokens @param stop_words [Array<String>] stop words to ignore

# File lib/tokkens/tokenizer.rb, line 35
def initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS)
  @tokens = tokens || Tokens.new
  @stop_words = stop_words
  @min_length = min_length
end

Public Instance Methods

get(s, **kwargs) click to toggle source

@return [Array<Fixnum>] array of token numbers

# File lib/tokkens/tokenizer.rb, line 42
def get(s, **kwargs)
  return [] if !s || s.strip == ''
  tokenize(s).map {|token| @tokens.get(token, **kwargs) }.compact
end

Private Instance Methods

include?(s) click to toggle source
# File lib/tokkens/tokenizer.rb, line 53
def include?(s)
  s.length >= @min_length && !@stop_words.include?(s)
end
tokenize(s) click to toggle source
# File lib/tokkens/tokenizer.rb, line 49
def tokenize(s)
  s.split.select(&method(:include?))
end