class Docsplit::TextCleaner

Cleans up OCR'd text by using a series of heuristics to remove garbage words. Algorithms taken from:

Automatic Removal of "Garbage Strings" in OCR Text: An Implementation
  -- Taghva, Nartker, Condit, and Borsack

Improving Search and Retrieval Performance through Shortening Documents,
Detecting Garbage, and Throwing out Jargon
  -- Kulp

Constants

ACRONYM
ALL_ALPHA
ALNUM
CONSONANT
CONSONANT_5
LOWER
NEWLINE
PUNCT
REPEAT
REPEATED
SINGLETONS
SPACE
UPPER
VOWEL
VOWEL_5
WORD

Cached regexes we plan on using.

Public Instance Methods

clean(text) click to toggle source

For the time being, `clean` uses the regular StringScanner, and not the multibyte-aware version, coercing to ASCII first.

# File lib/docsplit/text_cleaner.rb, line 35
def clean(text)
  if String.method_defined?(:encode)
    text.encode!('ascii', invalid: :replace, undef: :replace, replace: '?')
  else
    require 'iconv' unless defined?(Iconv)
    text = Iconv.iconv('ascii//translit//ignore', 'utf-8', text).first
  end

  scanner = StringScanner.new(text)
  cleaned = []
  spaced  = false
  loop do
    if space = scanner.scan(SPACE)
      cleaned.push(space) unless spaced && (space !~ NEWLINE)
      spaced = true
    elsif word = scanner.scan(WORD)
      unless garbage(word)
        cleaned.push(word)
        spaced = false
      end
    elsif scanner.eos?
      return cleaned.join('').gsub(REPEATED, '')
    end
  end
end
garbage(w) click to toggle source

Is a given word OCR garbage?

# File lib/docsplit/text_cleaner.rb, line 62
def garbage(w)
  acronym = w =~ ACRONYM

  # More than 30 bytes in length.
  (w.length > 30) ||

    # If there are three or more identical characters in a row in the string.
    (w =~ REPEAT) ||

    # More punctuation than alpha numerics.
    (!acronym && (w.scan(ALNUM).length < w.scan(PUNCT).length)) ||

    # Ignoring the first and last characters in the string, if there are three or
    # more different punctuation characters in the string.
    (w[1...-1].scan(PUNCT).uniq.length >= 3) ||

    # Four or more consecutive vowels, or five or more consecutive consonants.
    ((w =~ VOWEL_5) || (w =~ CONSONANT_5)) ||

    # Number of uppercase letters greater than lowercase letters, but the word is
    # not all uppercase + punctuation.
    (!acronym && (w.scan(UPPER).length > w.scan(LOWER).length)) ||

    # Single letters that are not A or I.
    (w.length == 1 && (w =~ ALL_ALPHA) && (w !~ SINGLETONS)) ||

    # All characters are alphabetic and there are 8 times more vowels than
    # consonants, or 8 times more consonants than vowels.
    (!acronym && (w.length > 2 && (w =~ ALL_ALPHA)) &&
      (((vows = w.scan(VOWEL).length) > (cons = w.scan(CONSONANT).length) * 8) ||
        (cons > vows * 8)))
end