class Yomu

Read text and metadata from files and documents using Apache Tika toolkit

Henkei monkey patch for configuration support

Constants

CONFIG_PATH
DEFAULT_SERVER_PORT
GEM_PATH
JAR_PATH
VERSION

Public Class Methods

configuration() click to toggle source
# File lib/henkei/configuration.rb, line 5
def self.configuration
  @configuration ||= Configuration.new
end
configure() { |configuration| ... } click to toggle source
# File lib/henkei/configuration.rb, line 9
def self.configure
  yield(configuration)
end
kill_server!() click to toggle source

Kills server started by Henkei.server

Always run this when you're done, or else Tika might run until you kill it manually
You might try putting your extraction in a begin..rescue...ensure...end block and
  putting this method in the ensure block.

Henkei.server(:text)
reports = ["report1.docx", "report2.doc", "report3.pdf"]
begin
  my_texts = reports.map{ |report_path| Henkei.new(report_path).text }
rescue
ensure
  Henkei.kill_server!
end
# File lib/henkei.rb, line 229
def self.kill_server!
  return unless @@server_pid

  Process.kill('INT', @@server_pid)
  @@server_pid = nil
  @@server_port = nil
end
mimetype(content_type) click to toggle source
# File lib/henkei.rb, line 35
def self.mimetype(content_type)
  if Henkei.configuration.mime_library == 'mime/types' && defined?(MIME::Types)
    warn '[DEPRECATION] `mime/types` is deprecated. Please use `mini_mime` instead.'\
      ' Use Henkei.configure and assign "mini_mime" to `mime_library`.'
    MIME::Types[content_type].first
  else
    MiniMime.lookup_by_content_type(content_type).tap do |object|
      object.define_singleton_method(:extensions) { [extension] }
    end
  end
end
new(input) click to toggle source

Create a new instance of Henkei with a given document.

Using a file path:

Henkei.new 'sample.pages'

Using a URL:

Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'

From a stream or an object which responds to read

Henkei.new File.open('sample.pages')
# File lib/henkei.rb, line 78
def initialize(input)
  if input.is_a? String
    if File.exist? input
      @path = input
    elsif input =~ URI::DEFAULT_PARSER.make_regexp
      @uri = URI.parse input
    else
      raise Errno::ENOENT, "missing file or invalid URI - #{input}"
    end
  elsif input.respond_to? :read
    @stream = input
  else
    raise TypeError, "can't read from #{input.class.name}"
  end
end
read(type, data) click to toggle source

Read text or metadata from a data buffer.

data = File.read 'sample.pages'
text = Henkei.read :text, data
metadata = Henkei.read :metadata, data
# File lib/henkei.rb, line 53
def self.read(type, data)
  result = @@server_pid ? server_read(data) : client_read(type, data)

  case type
  when :text then result
  when :html then result
  when :metadata then JSON.parse(result)
  when :mimetype then Henkei.mimetype(JSON.parse(result)['Content-Type'])
  end
end
server(type, custom_port = nil) click to toggle source

Returns pid of Tika server, started as a new spawned process.

type :html, :text or :metadata
custom_port e.g. 9293

Henkei.server(:text, 9294)
# File lib/henkei.rb, line 206
def self.server(type, custom_port = nil)
  @@server_port = custom_port || DEFAULT_SERVER_PORT

  @@server_pid = Process.spawn(*tika_command(type, server: true))
  sleep(2) # Give the server 2 seconds to spin up.
  @@server_pid
end

Private Class Methods

client_read(type, data) click to toggle source

Internal helper for calling to Tika library directly

# File lib/henkei.rb, line 248
def self.client_read(type, data)
  Open3.capture2(*tika_command(type), stdin_data: data, binmode: true).first
end
java_path() click to toggle source

Provide the path to the Java binary

# File lib/henkei.rb, line 241
def self.java_path
  ENV['JAVA_HOME'] ? "#{ENV['JAVA_HOME']}/bin/java" : 'java'
end
server_read(data) click to toggle source

Internal helper for calling to running Tika server

# File lib/henkei.rb, line 255
def self.server_read(data)
  s = TCPSocket.new('localhost', @@server_port)
  file = StringIO.new(data, 'r')

  loop do
    chunk = file.read(65_536)
    break unless chunk

    s.write(chunk)
  end

  # tell Tika that we're done sending data
  s.shutdown(Socket::SHUT_WR)

  resp = String.new ''
  loop do
    chunk = s.recv(65_536)
    break if chunk.empty? || !chunk

    resp << chunk
  end
  resp
end
switch_for_type(type) click to toggle source

Internal helper for building the Java command to call Tika

# File lib/henkei.rb, line 291
def self.switch_for_type(type)
  {
    text: ['-t'],
    html: ['-h'],
    metadata: %w[-m -j],
    mimetype: %w[-m -j]
  }[type]
end
tika_command(type, server: false) click to toggle source

Internal helper for building the Java command to call Tika

# File lib/henkei.rb, line 282
def self.tika_command(type, server: false)
  command = [java_path, '-Djava.awt.headless=true', '-jar', Henkei::JAR_PATH, "--config=#{Henkei::CONFIG_PATH}"]
  command += ['--server', '--port', @@server_port.to_s] if server
  command + switch_for_type(type)
end

Public Instance Methods

creation_date() click to toggle source

Returns true if the Henkei document was specified using a file path.

henkei = Henkei.new 'sample.pages'
henkei.path? #=> true
# File lib/henkei.rb, line 145
def creation_date
  return @creation_date if defined? @creation_date
  return unless metadata['Creation-Date']

  @creation_date = Time.parse(metadata['Creation-Date'])
end
data() click to toggle source

Returns the raw/unparsed content of the Henkei document.

henkei = Henkei.new 'sample.pages'
henkei.data
# File lib/henkei.rb, line 185
def data
  return @data if defined? @data

  if path?
    @data = File.read @path
  elsif uri?
    @data = Net::HTTP.get @uri
  elsif stream?
    @data = @stream.read
  end

  @data
end
html() click to toggle source

Returns the text content of the Henkei document in HTML.

henkei = Henkei.new 'sample.pages'
henkei.html
# File lib/henkei.rb, line 110
def html
  return @html if defined? @html

  @html = Henkei.read :html, data
end
metadata() click to toggle source

Returns the metadata hash of the Henkei document.

henkei = Henkei.new 'sample.pages'
henkei.metadata['Content-Type']
# File lib/henkei.rb, line 121
def metadata
  return @metadata if defined? @metadata

  @metadata = Henkei.read :metadata, data
end
mimetype() click to toggle source

Returns the mimetype object of the Henkei document.

henkei = Henkei.new 'sample.docx'
henkei.mimetype.content_type #=> 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
henkei.mimetype.extensions #=> ['docx']
# File lib/henkei.rb, line 133
def mimetype
  return @mimetype if defined? @mimetype

  content_type = metadata['Content-Type'].is_a?(Array) ? metadata['Content-Type'].first : metadata['Content-Type']
  @mimetype = Henkei.mimetype(content_type)
end
path?() click to toggle source

Returns true if the Henkei document was specified using a file path.

henkei = Henkei.new '/my/document/path/sample.docx'
henkei.path? #=> true
# File lib/henkei.rb, line 157
def path?
  !!@path
end
stream?() click to toggle source

Returns true if the Henkei document was specified from a stream or an object which responds to read.

file = File.open('sample.pages')
henkei = Henkei.new file
henkei.stream? #=> true
# File lib/henkei.rb, line 176
def stream?
  !!@stream
end
text() click to toggle source

Returns the text content of the Henkei document.

henkei = Henkei.new 'sample.pages'
henkei.text
# File lib/henkei.rb, line 99
def text
  return @text if defined? @text

  @text = Henkei.read :text, data
end
uri?() click to toggle source

Returns true if the Henkei document was specified using a URI.

henkei = Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
henkei.uri? #=> true
# File lib/henkei.rb, line 166
def uri?
  !!@uri
end