module Docsplit

The Docsplit module delegates to the Java PDF extractors.

Constants

DEPENDENCIES
ESCAPE
ESCAPED_ROOT
GM_FORMATS
METADATA_KEYS
ROOT
VERSION

Public Class Methods

clean_text(text) click to toggle source

Utility method to clean OCR'd text with garbage characters.

# File lib/docsplit.rb, line 83
def self.clean_text(text)
  TextCleaner.new.clean(text)
end
extract_images(pdfs, opts = {}) click to toggle source

Use the ExtractImages Java class to rasterize a PDF into each page's image.

# File lib/docsplit.rb, line 54
def self.extract_images(pdfs, opts = {})
  pdfs = ensure_pdfs(pdfs)
  opts[:pages] = normalize_value(opts[:pages]) if opts[:pages]
  ImageExtractor.new.extract(pdfs, opts)
end
extract_info(pdfs, opts = {}) click to toggle source
# File lib/docsplit.rb, line 77
def self.extract_info(pdfs, opts = {})
  pdfs = ensure_pdfs(pdfs)
  InfoExtractor.new.extract_all(pdfs, opts)
end
extract_pages(pdfs, opts = {}) click to toggle source

Use the ExtractPages Java class to burst a PDF into single pages.

# File lib/docsplit.rb, line 42
def self.extract_pages(pdfs, opts = {})
  pdfs = ensure_pdfs(pdfs)
  PageExtractor.new.extract(pdfs, opts)
end
extract_pdf(docs, opts = {}) click to toggle source

Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.

# File lib/docsplit.rb, line 62
def self.extract_pdf(docs, opts = {})
  PdfExtractor.new.extract(docs, opts)
end
extract_text(pdfs, opts = {}) click to toggle source

Use the ExtractText Java class to write out all embedded text.

# File lib/docsplit.rb, line 48
def self.extract_text(pdfs, opts = {})
  pdfs = ensure_pdfs(pdfs)
  TextExtractor.new.extract(pdfs, opts)
end

Private Class Methods

normalize_value(value) click to toggle source

Normalize a value in an options hash for the command line. Ranges look like: 1-10, Arrays like: 1,2,3.

# File lib/docsplit.rb, line 91
def self.normalize_value(value)
  case value
  when Range then value.to_a.join(',')
  when Array then value.map! { |v| v.is_a?(Range) ? normalize_value(v) : v }.join(',')
  else            value.to_s
  end
end