module Greeb::Parser

It is often necessary to find different entities in a natural language text. These entities are URLs, e-mail addresses, names, etc. This module includes several helpers that could help to solve these problems.

Constants

ABBREV

Another horrible pattern. Now for abbreviations.

APOSTROPHE

Apostrophe pattern.

EMAIL

A horrible e-mail pattern.

HTML

This pattern matches anything that looks like HTML. Or not.

TIME

Time pattern.

TOGETHER

Together pattern.

URL

An URL pattern. Not so precise, but IDN-compatible.

Public Instance Methods

abbrevs(text) click to toggle source

Recognize abbreviations in the input text.

@param text [String] input text.

@return [Array<Greeb::Span>] found abbreviations.

# File lib/greeb/parser.rb, line 63
def abbrevs(text)
  scan(text, ABBREV, :abbrev)
end
apostrophes(text, spans) click to toggle source

Retrieve apostrophes from the tokenized text. The algorithm may be more optimal.

@param text [String] input text. @param spans [Array<Greeb::Span>] already tokenized text.

@return [Array<Greeb::Span>] retrieved apostrophes.

# File lib/greeb/parser.rb, line 95
def apostrophes(text, spans)
  apostrophes = scan(text, APOSTROPHE, :apostrophe)
  return [] if apostrophes.empty?

  apostrophes.each { |s| Greeb.extract_spans(spans, s) }.clear

  spans.each_with_index.each_cons(3).reverse_each do |(s1, i), (s2, j), (s3, k)|
    next unless s1 && s1.type == :letter
    next unless s2 && s2.type == :apostrophe
    next unless !s3 || s3 && s3.type == :letter
    s3, k = s2, j unless s3
    apostrophes << Greeb::Span.new(s1.from, s3.to, s1.type)
    spans[i..k] = apostrophes.last
  end

  apostrophes
end
emails(text) click to toggle source

Recognize e-mail addresses in the input text.

@param text [String] input text.

@return [Array<Greeb::Span>] found e-mail addresses.

# File lib/greeb/parser.rb, line 53
def emails(text)
  scan(text, EMAIL, :email)
end
html(text) click to toggle source

Recognize HTML-alike entities in the input text.

@param text [String] input text.

@return [Array<Greeb::Span>] found HTML entities.

# File lib/greeb/parser.rb, line 73
def html(text)
  scan(text, HTML, :html)
end
time(text) click to toggle source

Recognize timestamps in the input text.

@param text [String] input text.

@return [Array<Greeb::Span>] found HTML entities.

# File lib/greeb/parser.rb, line 83
def time(text)
  scan(text, TIME, :time)
end
together(spans) click to toggle source

Merge some spans that are together.

@param spans [Array<Greeb::Span>] already tokenized text.

@return [Array<Greeb::Span>] merged spans.

# File lib/greeb/parser.rb, line 119
def together(spans)
  loop do
    converged = true

    spans.each_with_index.each_cons(2).reverse_each do |(s1, i), (s2, j)|
      next unless TOGETHER.include?(s1.type) && TOGETHER.include?(s2.type)
      spans[i..j] = Greeb::Span.new(s1.from, s2.to, :together)
      converged = false
    end

    break if converged
  end

  spans
end
urls(text) click to toggle source

Recognize URLs in the input text. Actually, URL is obsolete standard and this code should be rewritten to use the URI concept.

@param text [String] input text.

@return [Array<Greeb::Span>] found URLs.

# File lib/greeb/parser.rb, line 43
def urls(text)
  scan(text, URL, :url)
end

Private Instance Methods

scan(text, regexp, type, offset = 0) click to toggle source

Implementation of regexp-based {Greeb::Span} scanner.

@param text [String] input text. @param regexp [Regexp] regular expression to be used. @param type [Symbol] type field for the new {Greeb::Span} instances. @param offset [Fixnum] offset of the next match.

@return [Array<Greeb::Span>] found entities.

# File lib/greeb/parser.rb, line 145
def scan(text, regexp, type, offset = 0)
  Array.new.tap do |matches|
    while text and md = text.match(regexp)
      start, stop = md.offset(0)
      matches << Greeb::Span.new(offset + start, offset + stop, type)
      text, offset = text[stop..-1], offset + stop
    end
  end
end