class TactfulTokenizer::Frag

A fragment is a potential sentence, but is based only on the existence of a period. The text “Here in the U.S. Senate we prefer to devour our friends.” will be split into “Here in the U.S.” and “Senate we prefer to devour our friends.”

Attributes

cleaned[RW]

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment's words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment's features.

features[RW]

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment's words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment's features.

next[RW]

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment's words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment's features.

orig[RW]

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment's words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment's features.

pred[RW]

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment's words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment's features.

Public Class Methods

new(orig='') click to toggle source

Create a new fragment.

# File lib/tactful_tokenizer.rb, line 181
def initialize(orig='')
  @orig = orig
  clean(orig)
  @next, @pred, @features = nil, nil, nil
end

Public Instance Methods

clean(s) click to toggle source

Normalizes numbers and discards ambiguous punctuation. And then splits into an array, because realistically only the last and first words are ever accessed.

# File lib/tactful_tokenizer.rb, line 189
def clean(s)
  @cleaned = String.new(s)
  tokenize(@cleaned)
  @cleaned.gsub!(/[.,\d]*\d/, '<NUM>')
  @cleaned.gsub!(/[^[[:upper:][:lower:]]\d[:space:],!?.;:<>\-'\/$% ]/u, '')
  @cleaned.gsub!('--', ' ')
  @cleaned = @cleaned.split
end
push_w1_features(w1, model) click to toggle source
# File lib/tactful_tokenizer.rb, line 198
def push_w1_features w1, model
  if w1.chop.is_alphabetic? 
    features.push "w1length_#{[10, w1.length].min}", "w1abbr_#{model.non_abbrs[w1.chop]}"
  end
end
push_w2_features(w2, model) click to toggle source
# File lib/tactful_tokenizer.rb, line 204
def push_w2_features w2, model
  if w2.chop.is_alphabetic?
    features.push "w2cap_#{w2[0,1].is_upper_case?}", "w2lower_#{model.lower_words[w2.downcase]}"
  end
end