class TeRex::Classifier::BayesData

Public Class Methods

clean(text) click to toggle source

Return text with datetime and moneyterms replaced, remove cardinal terms (1st, 23rd, 42nd), remove punctuation. At one point we were replacing any non-word chars exlcuding spaces (/[^ws]/) like so `gsub(//, “”)` but I took it out as it removed some punctuation needed to distinguish some classes.

# File lib/te_rex/bayes_data.rb, line 48
def clean(text)
  dt = date_time(text)
  mt = money_term(dt)
  rp = remove_punct(mt)
  sp = remove_big_space(rp)
  ss = remove_space_seq(sp)
  remove_cardinal(ss)
end
clean_filtered_index(text) click to toggle source

Return a filtered word freq index without extra punctuation or short words

# File lib/te_rex/bayes_data.rb, line 63
def clean_filtered_index(text)
  filtered_index clean(text).split
end
clean_naive_index(text) click to toggle source

Return a word freq index without downcasing, stemming, or filtering with stop list

# File lib/te_rex/bayes_data.rb, line 68
def clean_naive_index(text)
  naive_index clean(text).split
end
clean_stemmed_filtered_index(text) click to toggle source

Return a filtered word freq index with stemmed morphemes and without extra punctuation or short words

# File lib/te_rex/bayes_data.rb, line 58
def clean_stemmed_filtered_index(text)
  stemmed_filtered_index clean(text).split
end
date_time(s) click to toggle source

Replace date times with TERM (09MAR04, 02-23-14, 2014/03/05)

# File lib/te_rex/bayes_data.rb, line 29
def date_time(s)
  s.gsub(/(^\d+)|(\s\d+(AM|PM))|(\d{2}\w{3}\d{2})|(\d{2}\:\d{2})|(\d{2,4}\-\d{2,4}-\d{2,4})|(\d{1,4}\/\d{2,4}\/\d{2,4})|(\d+\:\d+)/, 'datetime')
end
index_frequency(text) click to toggle source

Return a Hashed Index of words => instance_count. Each word in the string is interned and shows count in the document.

# File lib/te_rex/bayes_data.rb, line 40
def index_frequency(text)
  cfi = clean_stemmed_filtered_index(text)
  cni = clean_filtered_index(text)
  cfi.merge(cni)
end
money_term(s) click to toggle source

Replace money types with TERM ($60, 120.00, $423.89)

# File lib/te_rex/bayes_data.rb, line 34
def money_term(s)
  s.gsub(/(\$\d+\.\d+)|(\$\d+)|(\d+\.\d+)/, 'moneyterm')
end
remove_big_space(s) click to toggle source

Remove all kinds of newlines or big spaces: tab, newline, carraige return

# File lib/te_rex/bayes_data.rb, line 14
def remove_big_space(s)
  s.gsub(/\n|\t|\r/,' ')
end
remove_cardinal(s) click to toggle source

Remove cardinal terms (1st, 23rd, 42nd)

# File lib/te_rex/bayes_data.rb, line 24
def remove_cardinal(s)
  s.gsub(/[0-9]{2}[a-z,A-Z]{2}/, '')
end
remove_punct(s) click to toggle source

Remove all kinds of explicit punctuation.

# File lib/te_rex/bayes_data.rb, line 9
def remove_punct(s)
  s.gsub(/(\,)|(\?)|(\.)|(\!)|(\;)|(\:)|(\")|(\@)|(\#)|(\$)|(\^)|(\&)|(\*)|(\()|(\))|(\_)|(\=)|(\+)|(\[)|(\])|(\\)|(\|)|(\<)|(\>)|(\/)|(\`)|(\{)|(\})/, ' ')
end
remove_space_seq(s) click to toggle source

Remove sequences of whitespace

# File lib/te_rex/bayes_data.rb, line 19
def remove_space_seq(s)
  s.gsub(/\s{2,}/,' ')
end

Private Class Methods

filtered_index(word_array) click to toggle source

Downcase, filter against stop list, and ignore sequences less that 2 chars.

# File lib/te_rex/bayes_data.rb, line 87
def filtered_index(word_array)
  idx = Hash.new(0)
  word_array.each do |word|
    word.downcase!
    if !TeRex::StopWord::LIST.include?(word) && word.length > 3
      idx[word.intern] += 1
    end
  end

  idx
end
naive_index(word_array) click to toggle source

Count everything in the word array.

# File lib/te_rex/bayes_data.rb, line 100
def naive_index(word_array)
  idx = Hash.new(0)
  word_array.each do |word|
    idx[word.intern] += 1
  end

  idx
end
stemmed_filtered_index(word_array) click to toggle source

Downcase, filter against stop list, ignore sequences less that 1 chars, and stem words

# File lib/te_rex/bayes_data.rb, line 74
def stemmed_filtered_index(word_array)
  idx = Hash.new(0)
  word_array.each do |word|
    word.downcase!
    if !TeRex::StopWord::LIST.include?(word) && word.length > 1
      idx[word.stem.intern] += 1
    end
  end

  idx
end