module Docsplit
The Docsplit
module delegates to the Java PDF extractors.
Constants
- DEPENDENCIES
- ESCAPE
- ESCAPED_ROOT
- GM_FORMATS
- METADATA_KEYS
- ROOT
- VERSION
Public Class Methods
clean_text(text)
click to toggle source
Utility method to clean OCR’d text with garbage characters.
# File lib/docsplit.rb, line 78 def self.clean_text(text) TextCleaner.new.clean(text) end
extract_images(pdfs, opts={})
click to toggle source
Use the ExtractImages Java class to rasterize a PDF into each page’s image.
# File lib/docsplit.rb, line 49 def self.extract_images(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) opts[:pages] = normalize_value(opts[:pages]) if opts[:pages] ImageExtractor.new.extract(pdfs, opts) end
extract_info(pdfs, opts={})
click to toggle source
# File lib/docsplit.rb, line 72 def self.extract_info(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) InfoExtractor.new.extract_all(pdfs, opts) end
extract_pages(pdfs, opts={})
click to toggle source
Use the ExtractPages Java class to burst a PDF into single pages.
# File lib/docsplit.rb, line 37 def self.extract_pages(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) PageExtractor.new.extract(pdfs, opts) end
extract_pdf(docs, opts={})
click to toggle source
Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.
# File lib/docsplit.rb, line 57 def self.extract_pdf(docs, opts={}) PdfExtractor.new.extract(docs, opts) end
extract_text(pdfs, opts={})
click to toggle source
Use the ExtractText Java class to write out all embedded text.
# File lib/docsplit.rb, line 43 def self.extract_text(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) TextExtractor.new.extract(pdfs, opts) end
Private Class Methods
normalize_value(value)
click to toggle source
Normalize a value in an options hash for the command line. Ranges look like: 1-10, Arrays like: 1,2,3.
# File lib/docsplit.rb, line 86 def self.normalize_value(value) case value when Range then value.to_a.join(',') when Array then value.map! {|v| v.is_a?(Range) ? normalize_value(v) : v }.join(',') else value.to_s end end