module Docsplit

The Docsplit module delegates to the Java PDF extractors.

Constants

DEPENDENCIES
ESCAPE
ESCAPED_ROOT
GM_FORMATS
METADATA_KEYS
ROOT
VERSION

Public Class Methods

clean_text(text) click to toggle source

Utility method to clean OCR'd text with garbage characters.

# File lib/docsplit.rb, line 85
def self.clean_text(text)
  TextCleaner.new.clean(text)
end
extract_images(pdfs, opts={}) click to toggle source

Use the ExtractImages Java class to rasterize a PDF into each page's image.

# File lib/docsplit.rb, line 56
def self.extract_images(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  opts[:pages] = normalize_value(opts[:pages]) if opts[:pages]
  ImageExtractor.new.extract(pdfs, opts)
end
extract_info(pdfs, opts={}) click to toggle source
# File lib/docsplit.rb, line 79
def self.extract_info(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  InfoExtractor.new.extract_all(pdfs, opts)
end
extract_pages(pdfs, opts={}) click to toggle source

Use the ExtractPages Java class to burst a PDF into single pages.

# File lib/docsplit.rb, line 44
def self.extract_pages(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  PageExtractor.new.extract(pdfs, opts)
end
extract_pdf(docs, opts={}) click to toggle source

Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.

# File lib/docsplit.rb, line 64
def self.extract_pdf(docs, opts={})
  PdfExtractor.new.extract(docs, opts)
end
extract_text(pdfs, opts={}) click to toggle source

Use the ExtractText Java class to write out all embedded text.

# File lib/docsplit.rb, line 50
def self.extract_text(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  TextExtractor.new.extract(pdfs, opts)
end

Private Class Methods

normalize_value(value) click to toggle source

Normalize a value in an options hash for the command line. Ranges look like: 1-10, Arrays like: 1,2,3.

# File lib/docsplit.rb, line 93
def self.normalize_value(value)
  case value
  when Range then value.to_a.join(',')
  when Array then value.map! {|v| v.is_a?(Range) ? normalize_value(v) : v }.join(',')
  else            value.to_s
  end
end