module Wukong::Hadoop::HadoopInvocation

Provides methods for executing a map/reduce job on a Hadoop cluster via Hadoop streaming.

Public Instance Methods

hadoop_commandline() click to toggle source

Return the Hadoop command used to launch this job in a Hadoop cluster.

You should be able to copy, paste, and run this command unmodified when debugging.

@return [String]

# File lib/wukong-hadoop/runner/hadoop_invocation.rb, line 24
def hadoop_commandline
  [
   hadoop_runner,
   "jar #{hadoop_streaming_jar}",
   hadoop_jobconf_options,
   "-D mapred.job.name='#{job_name}'",
   hadoop_files,
   hadoop_other_args,
   "-mapper       '#{mapper_commandline}'",
   "-reducer      '#{reducer_commandline}'",
   "-input        '#{input_paths}'",
   "-output       '#{output_path}'",
   io_formats,
   hadoop_recycle_env,
  ].flatten.compact.join(" \t\\\n  ")
end
input_format() click to toggle source

The input format to use.

Respects the value of --input_format.

@return [String]

# File lib/wukong-hadoop/runner/hadoop_invocation.rb, line 59
def input_format
  settings[:input_format]
end
job_name() click to toggle source

The job name that will be passed to Hadoop.

Respects the --job_name option if given, otherwise constructs one from the given processors, input, and output paths.

@return [String]

# File lib/wukong-hadoop/runner/hadoop_invocation.rb, line 48
def job_name
  return settings[:job_name] if settings[:job_name]
  relevant_filename = args.compact.uniq.map { |path| File.basename(path, '.rb') }.join('-')
  "#{relevant_filename}---#{input_paths}---#{output_path}".gsub(%r{[^\w/\.\-\+]+}, '')
end
output_format() click to toggle source

The output format to use.

Respects the value of --output_format.

@return [String]

# File lib/wukong-hadoop/runner/hadoop_invocation.rb, line 68
def output_format
  settings[:output_format]
end
remove_output_path() click to toggle source

Remove the output path.

Will not actually do anything if the --dry_run option is also given.

# File lib/wukong-hadoop/runner/hadoop_invocation.rb, line 13
def remove_output_path
  execute_command("#{hadoop_runner} fs -rmr '#{output_path}'")
end