class TactfulTokenizer::Doc

A document represents the input text. It holds a list of fragments generated from the text.

Attributes

frags[RW]

List of fragments.

Public Class Methods

new(text) click to toggle source

Receives a text, which is then broken into fragments. A fragment ends with a period, quesetion mark, or exclamation mark followed possibly by right handed punctuation like quotation marks or closing braces and trailing whitespace. Failing that, it'll accept something like “I hate cheesen” No, it doesn't have a period, but that's the end of paragraph.

Input assumption: Paragraphs delimited by line breaks.

# File lib/tactful_tokenizer.rb, line 133
def initialize(text)
  @frags = []
  res = nil
  text.each_line do |line|
    unless line.strip.empty?
      line.split(/(.*?[.!?](?:[”"')\]}]|(?:<.*>))*[[:space:]])/u).each do |res|
        unless res.strip.empty?
          frag = Frag.new(res)
          @frags.last.next = frag.cleaned.first unless @frags.empty?
          @frags.push frag
        end
      end
    end
  end
end

Public Instance Methods

segment() click to toggle source

Segments the text. More precisely, it reassembles the fragments into sentences. We call something a sentence whenever it is more likely to be a sentence than not.

# File lib/tactful_tokenizer.rb, line 151
def segment
  sents, sent = [], []
  thresh = 0.5

  frag = nil
  @frags.each do |frag|
    sent.push(frag.orig)
    if frag.pred && frag.pred > thresh
      break if frag.orig.nil?
      sents.push(sent.join('').strip)
      sent = []
    end
  end
  sents
end