class Yomu
Read text and metadata from files and documents using Apache Tika toolkit
Henkei
monkey patch for configuration support
Constants
- CONFIG_PATH
- DEFAULT_SERVER_PORT
- GEM_PATH
- JAR_PATH
- VERSION
Public Class Methods
# File lib/henkei/configuration.rb, line 5 def self.configuration @configuration ||= Configuration.new end
# File lib/henkei/configuration.rb, line 9 def self.configure yield(configuration) end
Kills server started by Henkei.server
Always run this when you're done, or else Tika might run until you kill it manually You might try putting your extraction in a begin..rescue...ensure...end block and putting this method in the ensure block. Henkei.server(:text) reports = ["report1.docx", "report2.doc", "report3.pdf"] begin my_texts = reports.map{ |report_path| Henkei.new(report_path).text } rescue ensure Henkei.kill_server! end
# File lib/henkei.rb, line 229 def self.kill_server! return unless @@server_pid Process.kill('INT', @@server_pid) @@server_pid = nil @@server_port = nil end
# File lib/henkei.rb, line 35 def self.mimetype(content_type) if Henkei.configuration.mime_library == 'mime/types' && defined?(MIME::Types) warn '[DEPRECATION] `mime/types` is deprecated. Please use `mini_mime` instead.'\ ' Use Henkei.configure and assign "mini_mime" to `mime_library`.' MIME::Types[content_type].first else MiniMime.lookup_by_content_type(content_type).tap do |object| object.define_singleton_method(:extensions) { [extension] } end end end
Create a new instance of Henkei
with a given document.
Using a file path:
Henkei.new 'sample.pages'
Using a URL:
Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
From a stream or an object which responds to read
Henkei.new File.open('sample.pages')
# File lib/henkei.rb, line 78 def initialize(input) if input.is_a? String if File.exist? input @path = input elsif input =~ URI::DEFAULT_PARSER.make_regexp @uri = URI.parse input else raise Errno::ENOENT, "missing file or invalid URI - #{input}" end elsif input.respond_to? :read @stream = input else raise TypeError, "can't read from #{input.class.name}" end end
Read text or metadata from a data buffer.
data = File.read 'sample.pages' text = Henkei.read :text, data metadata = Henkei.read :metadata, data
# File lib/henkei.rb, line 53 def self.read(type, data) result = @@server_pid ? server_read(data) : client_read(type, data) case type when :text then result when :html then result when :metadata then JSON.parse(result) when :mimetype then Henkei.mimetype(JSON.parse(result)['Content-Type']) end end
Returns pid of Tika server, started as a new spawned process.
type :html, :text or :metadata custom_port e.g. 9293 Henkei.server(:text, 9294)
# File lib/henkei.rb, line 206 def self.server(type, custom_port = nil) @@server_port = custom_port || DEFAULT_SERVER_PORT @@server_pid = Process.spawn(*tika_command(type, server: true)) sleep(2) # Give the server 2 seconds to spin up. @@server_pid end
Private Class Methods
Internal helper for calling to Tika library directly
# File lib/henkei.rb, line 248 def self.client_read(type, data) Open3.capture2(*tika_command(type), stdin_data: data, binmode: true).first end
Provide the path to the Java binary
# File lib/henkei.rb, line 241 def self.java_path ENV['JAVA_HOME'] ? "#{ENV['JAVA_HOME']}/bin/java" : 'java' end
Internal helper for calling to running Tika server
# File lib/henkei.rb, line 255 def self.server_read(data) s = TCPSocket.new('localhost', @@server_port) file = StringIO.new(data, 'r') loop do chunk = file.read(65_536) break unless chunk s.write(chunk) end # tell Tika that we're done sending data s.shutdown(Socket::SHUT_WR) resp = String.new '' loop do chunk = s.recv(65_536) break if chunk.empty? || !chunk resp << chunk end resp end
Internal helper for building the Java command to call Tika
# File lib/henkei.rb, line 291 def self.switch_for_type(type) { text: ['-t'], html: ['-h'], metadata: %w[-m -j], mimetype: %w[-m -j] }[type] end
Internal helper for building the Java command to call Tika
# File lib/henkei.rb, line 282 def self.tika_command(type, server: false) command = [java_path, '-Djava.awt.headless=true', '-jar', Henkei::JAR_PATH, "--config=#{Henkei::CONFIG_PATH}"] command += ['--server', '--port', @@server_port.to_s] if server command + switch_for_type(type) end
Public Instance Methods
Returns true
if the Henkei
document was specified using a file path.
henkei = Henkei.new 'sample.pages' henkei.path? #=> true
# File lib/henkei.rb, line 145 def creation_date return @creation_date if defined? @creation_date return unless metadata['Creation-Date'] @creation_date = Time.parse(metadata['Creation-Date']) end
Returns the raw/unparsed content of the Henkei
document.
henkei = Henkei.new 'sample.pages' henkei.data
# File lib/henkei.rb, line 185 def data return @data if defined? @data if path? @data = File.read @path elsif uri? @data = Net::HTTP.get @uri elsif stream? @data = @stream.read end @data end
Returns the text content of the Henkei
document in HTML.
henkei = Henkei.new 'sample.pages' henkei.html
# File lib/henkei.rb, line 110 def html return @html if defined? @html @html = Henkei.read :html, data end
Returns the metadata hash of the Henkei
document.
henkei = Henkei.new 'sample.pages' henkei.metadata['Content-Type']
# File lib/henkei.rb, line 121 def metadata return @metadata if defined? @metadata @metadata = Henkei.read :metadata, data end
Returns the mimetype object of the Henkei
document.
henkei = Henkei.new 'sample.docx' henkei.mimetype.content_type #=> 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' henkei.mimetype.extensions #=> ['docx']
# File lib/henkei.rb, line 133 def mimetype return @mimetype if defined? @mimetype content_type = metadata['Content-Type'].is_a?(Array) ? metadata['Content-Type'].first : metadata['Content-Type'] @mimetype = Henkei.mimetype(content_type) end
Returns true
if the Henkei
document was specified using a file path.
henkei = Henkei.new '/my/document/path/sample.docx' henkei.path? #=> true
# File lib/henkei.rb, line 157 def path? !!@path end
Returns true
if the Henkei
document was specified from a stream or an object which responds to read
.
file = File.open('sample.pages') henkei = Henkei.new file henkei.stream? #=> true
# File lib/henkei.rb, line 176 def stream? !!@stream end
Returns the text content of the Henkei
document.
henkei = Henkei.new 'sample.pages' henkei.text
# File lib/henkei.rb, line 99 def text return @text if defined? @text @text = Henkei.read :text, data end
Returns true
if the Henkei
document was specified using a URI.
henkei = Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx' henkei.uri? #=> true
# File lib/henkei.rb, line 166 def uri? !!@uri end