class Slaw::Parse::Cleanser

Helper class to run various cleanup routines on plain text.

Some of these routines can safely be run multiple times, others are meant to be run only once.

Public Instance Methods

chomp(s) click to toggle source

Get rid of whitespace at the end of lines and at the start and end of the entire string.

# File lib/slaw/parse/cleanser.rb, line 49
def chomp(s)
  # trailing whitespace at end of lines
  s = s.gsub(/ +$/, '')

  # whitespace on either side
  s.strip
end
cleanup(s) click to toggle source

Run general cleanup, such as stripping bad chars and removing unnecessary whitespace. This is idempotent and safe to run multiple times.

# File lib/slaw/parse/cleanser.rb, line 14
def cleanup(s)
  s = scrub(s)
  s = correct_newlines(s)
  s = expand_tabs(s)
  s = chomp(s)
  s = enforce_newline(s)
end
correct_newlines(s) click to toggle source

line endings

# File lib/slaw/parse/cleanser.rb, line 29
def correct_newlines(s)
  s.gsub(/\r\n/, "\n")\
   .gsub(/\r/, "\n")
end
enforce_newline(s) click to toggle source
# File lib/slaw/parse/cleanser.rb, line 57
def enforce_newline(s)
  # ensure string ends with a newline
  s.end_with?("\n") ? s : (s + "\n")
end
expand_tabs(s) click to toggle source

tabs to spaces

# File lib/slaw/parse/cleanser.rb, line 42
def expand_tabs(s)
  s.gsub(/\t/, ' ')\
   .gsub("\u00A0", ' ') # non-breaking space
end
remove_empty_lines(s) click to toggle source

# File lib/slaw/parse/cleanser.rb, line 24
def remove_empty_lines(s)
  s.gsub(/\n\s*$/, '')
end
scrub(s) click to toggle source

strip invalid bytes and ones we don't like

# File lib/slaw/parse/cleanser.rb, line 35
def scrub(s)
  # we often get this unicode codepoint in the string, nuke it
  s.gsub([65532].pack('U*'), '')\
   .gsub(/\n*/, '')
end