class Hydra::Derivatives::Processors::FullText

Extract the full text from the content using Solr's extract handler

Public Instance Methods

process() click to toggle source

Run the full text extraction and save the result @return [TrueClass,FalseClass] was the process successful.

# File lib/hydra/derivatives/processors/full_text.rb, line 8
def process
  output_file_service.call(extract, directives)
end

Private Instance Methods

check_for_ssl() click to toggle source
# File lib/hydra/derivatives/processors/full_text.rb, line 69
def check_for_ssl
  uri.scheme == 'https'
end
connection_url() click to toggle source

@returns [URI] path to the solr collection

# File lib/hydra/derivatives/processors/full_text.rb, line 74
def connection_url
  ActiveFedora::SolrService.instance.conn.uri
end
extract() click to toggle source

Extract full text from the content using Solr's extract handler. This will extract text from the file

@return [String] The extracted text

# File lib/hydra/derivatives/processors/full_text.rb, line 19
def extract
  JSON.parse(fetch)[''].rstrip
end
fetch() click to toggle source

send the request to the extract service and return the response if it was successful. TODO: this pulls the whole file into memory. We should stream it from Fedora instead @return [String] the result of calling the extract service

# File lib/hydra/derivatives/processors/full_text.rb, line 26
def fetch
  resp = http_request
  raise "Solr Extract service was unsuccessful. '#{uri}' returned code #{resp.code} for #{source_path}\n#{resp.body}" unless resp.code == '200'

  file_content.rewind if file_content.respond_to?(:rewind)
  resp.body.force_encoding(resp.type_params['charset']) if resp.type_params['charset']
  resp.body
end
file_content() click to toggle source
# File lib/hydra/derivatives/processors/full_text.rb, line 46
def file_content
  @file_content ||= File.open(source_path).read
end
http_request() click to toggle source

Send the request to the extract service @return [Net::HttpResponse] the result of calling the extract service

# File lib/hydra/derivatives/processors/full_text.rb, line 37
def http_request
  Net::HTTP.start(uri.host, uri.port, use_ssl: check_for_ssl) do |http|
    req = Net::HTTP::Post.new(uri.request_uri, request_headers)
    req.basic_auth uri.user, uri.password unless uri.password.nil?
    req.body = file_content
    http.request req
  end
end
mime_type() click to toggle source
# File lib/hydra/derivatives/processors/full_text.rb, line 56
def mime_type
  Hydra::Derivatives::MimeTypeService.mime_type(source_path)
end
original_size() click to toggle source
# File lib/hydra/derivatives/processors/full_text.rb, line 60
def original_size
  File.size(source_path)
end
request_headers() click to toggle source

@return [Hash] the request headers to send to the Solr extract service

# File lib/hydra/derivatives/processors/full_text.rb, line 51
def request_headers
  { Faraday::Request::UrlEncoded::CONTENT_TYPE => mime_type.to_s,
    Faraday::Adapter::CONTENT_LENGTH => original_size.to_s }
end
uri() click to toggle source

@returns [URI] path to the extract service

# File lib/hydra/derivatives/processors/full_text.rb, line 65
def uri
  @uri ||= connection_url + 'update/extract?extractOnly=true&wt=json&extractFormat=text'
end