class TactfulTokenizer::Model

A model stores normalized probabilities of different features occuring.

Attributes

feats[RW]

feats = {feature => normalized probability of feature} lower_words = {token => log count of occurences in lower case} non_abbrs = {token => log count of occurences when not an abbrv.}

lower_words[RW]

feats = {feature => normalized probability of feature} lower_words = {token => log count of occurences in lower case} non_abbrs = {token => log count of occurences when not an abbrv.}

non_abbrs[RW]

feats = {feature => normalized probability of feature} lower_words = {token => log count of occurences in lower case} non_abbrs = {token => log count of occurences when not an abbrv.}

Public Class Methods

new(feats=" click to toggle source

Initialize the model. feats, lower_words, and non_abbrs indicate the locations of the respective Marshal dumps.

# File lib/tactful_tokenizer.rb, line 51
def initialize(feats="#{File.dirname(__FILE__)}/models/features.mar", lower_words="#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs="#{File.dirname(__FILE__)}/models/non_abbrs.mar")
  @feats, @lower_words, @non_abbrs = [feats, lower_words, non_abbrs].map do |file|
    File.open(file) do |f|
      Marshal.load(f.read)
    end
  end
  @p0 = @feats["<prior>"] ** 4  
end

Public Instance Methods

classify(doc) click to toggle source

Assign a prediction (probability, to be precise) to each sentence fragment. For each feature in each fragment we hunt up the normalized probability and multiply. This is a fairly straightforward Bayesian probabilistic algorithm.

# File lib/tactful_tokenizer.rb, line 79
def classify(doc)
  frag, probs, feat = nil, nil, nil
  doc.frags.each do |frag|
    probs = @p0
    frag.features.each do |feat|
      probs *= @feats[feat]
    end
    frag.pred = probs / (probs + 1)
  end
end
featurize(doc) click to toggle source

Get the features of every fragment.

# File lib/tactful_tokenizer.rb, line 91
def featurize(doc)
  frag = nil
  doc.frags.each do |frag|
    get_features(frag, self)
  end
end
get_features(frag, model) click to toggle source

Finds the features in a text fragment of the form: … w1. (sb?) w2 … Features listed in rough order of importance:

  • w1: a word that includes a period.

  • w2: the next word, if it exists.

  • w1length: the number of alphabetic characters in w1.

  • both: w1 and w2 taken together.

  • w1abbr: logarithmic count of w1 occuring without a period.

  • w2lower: logarithmiccount of w2 occuring lowercased.

# File lib/tactful_tokenizer.rb, line 107
def get_features(frag, model)
  w1 = (frag.cleaned.last or '')
  w2 = (frag.next or '')

  frag.features = ["w1_#{w1}", "w2_#{w2}", "both_#{w1}_#{w2}"]

  unless w2.empty?
    frag.push_w1_features(w1, model)
    frag.push_w2_features(w2, model)
  end
end
tokenize_text(text) click to toggle source

This function is the only one that'll end up being used. m = TactfulTokenizer::Model.new m.tokenize_text(“Hey, are these two sentences? I bet they should be.”)

> [“Hey, are these two sentences?”, “I bet they should be.”]

# File lib/tactful_tokenizer.rb, line 69
def tokenize_text(text)
  data = Doc.new(text)
  featurize(data)
  classify(data)
  return data.segment
end