class Slaw::Parse::Cleanser
Helper class to run various cleanup routines on plain text.
Some of these routines can safely be run multiple times, others are meant to be run only once.
Public Instance Methods
chomp(s)
click to toggle source
Get rid of whitespace at the end of lines and at the start and end of the entire string.
# File lib/slaw/parse/cleanser.rb, line 49 def chomp(s) # trailing whitespace at end of lines s = s.gsub(/ +$/, '') # whitespace on either side s.strip end
cleanup(s)
click to toggle source
Run general cleanup, such as stripping bad chars and removing unnecessary whitespace. This is idempotent and safe to run multiple times.
# File lib/slaw/parse/cleanser.rb, line 14 def cleanup(s) s = scrub(s) s = correct_newlines(s) s = expand_tabs(s) s = chomp(s) s = enforce_newline(s) end
correct_newlines(s)
click to toggle source
line endings
# File lib/slaw/parse/cleanser.rb, line 29 def correct_newlines(s) s.gsub(/\r\n/, "\n")\ .gsub(/\r/, "\n") end
enforce_newline(s)
click to toggle source
# File lib/slaw/parse/cleanser.rb, line 57 def enforce_newline(s) # ensure string ends with a newline s.end_with?("\n") ? s : (s + "\n") end
expand_tabs(s)
click to toggle source
tabs to spaces
# File lib/slaw/parse/cleanser.rb, line 42 def expand_tabs(s) s.gsub(/\t/, ' ')\ .gsub("\u00A0", ' ') # non-breaking space end
remove_empty_lines(s)
click to toggle source
# File lib/slaw/parse/cleanser.rb, line 24 def remove_empty_lines(s) s.gsub(/\n\s*$/, '') end
scrub(s)
click to toggle source
strip invalid bytes and ones we don't like
# File lib/slaw/parse/cleanser.rb, line 35 def scrub(s) # we often get this unicode codepoint in the string, nuke it s.gsub([65532].pack('U*'), '')\ .gsub(/\n*/, '') end