class TactfulTokenizer::Model
A model stores normalized probabilities of different features occuring.
Attributes
feats = {feature => normalized probability of feature} lower_words
= {token => log count of occurences in lower case} non_abbrs
= {token => log count of occurences when not an abbrv.}
feats = {feature => normalized probability of feature} lower_words
= {token => log count of occurences in lower case} non_abbrs
= {token => log count of occurences when not an abbrv.}
feats = {feature => normalized probability of feature} lower_words
= {token => log count of occurences in lower case} non_abbrs
= {token => log count of occurences when not an abbrv.}
Public Class Methods
Initialize the model. feats, lower_words
, and non_abbrs
indicate the locations of the respective Marshal dumps.
# File lib/tactful_tokenizer.rb, line 51 def initialize(feats="#{File.dirname(__FILE__)}/models/features.mar", lower_words="#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs="#{File.dirname(__FILE__)}/models/non_abbrs.mar") @feats, @lower_words, @non_abbrs = [feats, lower_words, non_abbrs].map do |file| File.open(file) do |f| Marshal.load(f.read) end end @p0 = @feats["<prior>"] ** 4 end
Public Instance Methods
Assign a prediction (probability, to be precise) to each sentence fragment. For each feature in each fragment we hunt up the normalized probability and multiply. This is a fairly straightforward Bayesian probabilistic algorithm.
# File lib/tactful_tokenizer.rb, line 79 def classify(doc) frag, probs, feat = nil, nil, nil doc.frags.each do |frag| probs = @p0 frag.features.each do |feat| probs *= @feats[feat] end frag.pred = probs / (probs + 1) end end
Get the features of every fragment.
# File lib/tactful_tokenizer.rb, line 91 def featurize(doc) frag = nil doc.frags.each do |frag| get_features(frag, self) end end
Finds the features in a text fragment of the form: … w1. (sb?) w2 … Features listed in rough order of importance:
-
w1: a word that includes a period.
-
w2: the next word, if it exists.
-
w1length: the number of alphabetic characters in w1.
-
both: w1 and w2 taken together.
-
w1abbr: logarithmic count of w1 occuring without a period.
-
w2lower: logarithmiccount of w2 occuring lowercased.
# File lib/tactful_tokenizer.rb, line 107 def get_features(frag, model) w1 = (frag.cleaned.last or '') w2 = (frag.next or '') frag.features = ["w1_#{w1}", "w2_#{w2}", "both_#{w1}_#{w2}"] unless w2.empty? frag.push_w1_features(w1, model) frag.push_w2_features(w2, model) end end
This function is the only one that'll end up being used. m = TactfulTokenizer::Model.new
m.tokenize_text(“Hey, are these two sentences? I bet they should be.”)
> [“Hey, are these two sentences?”, “I bet they should be.”]¶ ↑
# File lib/tactful_tokenizer.rb, line 69 def tokenize_text(text) data = Doc.new(text) featurize(data) classify(data) return data.segment end