class TactfulTokenizer::Doc
A document represents the input text. It holds a list of fragments generated from the text.
Attributes
frags[RW]
List of fragments.
Public Class Methods
new(text)
click to toggle source
Receives a text, which is then broken into fragments. A fragment ends with a period, quesetion mark, or exclamation mark followed possibly by right handed punctuation like quotation marks or closing braces and trailing whitespace. Failing that, it'll accept something like “I hate cheesen” No, it doesn't have a period, but that's the end of paragraph.
Input assumption: Paragraphs delimited by line breaks.
# File lib/tactful_tokenizer.rb, line 133 def initialize(text) @frags = [] res = nil text.each_line do |line| unless line.strip.empty? line.split(/(.*?[.!?](?:[”"')\]}]|(?:<.*>))*[[:space:]])/u).each do |res| unless res.strip.empty? frag = Frag.new(res) @frags.last.next = frag.cleaned.first unless @frags.empty? @frags.push frag end end end end end
Public Instance Methods
segment()
click to toggle source
Segments the text. More precisely, it reassembles the fragments into sentences. We call something a sentence whenever it is more likely to be a sentence than not.
# File lib/tactful_tokenizer.rb, line 151 def segment sents, sent = [], [] thresh = 0.5 frag = nil @frags.each do |frag| sent.push(frag.orig) if frag.pred && frag.pred > thresh break if frag.orig.nil? sents.push(sent.join('').strip) sent = [] end end sents end