module Wukong::Hadoop::EnvMethods

Hadoop streaming exposes several environment variables to scripts it executes. This module contains methods that make these variables easily accessed from within a processor.

Since these environment variables are ultimately set by Hadoop's streaming jar when executing inside Hadoop, you'll have to set them manually when testing locally.

Via @pskomoroch via @tlipcon:

"there is a little known Hadoop Streaming trick buried in this Python
 script. You will notice that the date is not actually in the raw log
 data itself, but is part of the filename. It turns out that Hadoop makes
 job parameters you would fetch in Java with something like
 job.get("mapred.input.file") available as environment variables for
 streaming jobs, with periods replaced with underscores:

   filepath = os.environ["map_input_file"]
   filename = os.path.split(filepath)[-1]

Public Instance Methods

attempt_id() click to toggle source

ID of the current map/reduce attempt.

@return [String]

# File lib/wukong-hadoop/hadoop_env_methods.rb, line 65
def attempt_id
  ENV['mapred_task_id']
end
curr_task_id() click to toggle source

ID of the current map/reduce task.

@return [String]

# File lib/wukong-hadoop/hadoop_env_methods.rb, line 72
def curr_task_id
  ENV['mapred_tip_id']
end
hadoop_streaming_parameter(name) click to toggle source

Fetch a parameter set by Hadoop streaming in the environment of the currently executing process.

@param [String] name the '.' separated parameter name to fetch @return [String] the value from the process' environment

# File lib/wukong-hadoop/hadoop_env_methods.rb, line 30
def hadoop_streaming_parameter name
  ENV[name.gsub('.', '_')]
end
input_dir() click to toggle source

Directory of the (data) file currently being processed.

@return [String]

# File lib/wukong-hadoop/hadoop_env_methods.rb, line 44
def input_dir
  ENV['mapred_input_dir']
end
input_file() click to toggle source

Path of the (data) file currently being processed.

@return [String]

# File lib/wukong-hadoop/hadoop_env_methods.rb, line 37
def input_file
  ENV['map_input_file']
end
map_input_length() click to toggle source

Length of the chunk currently being processed within the current input file.

@return [String]

# File lib/wukong-hadoop/hadoop_env_methods.rb, line 58
def map_input_length
  ENV['map_input_length']
end
map_input_start_offset() click to toggle source

Offset of the chunk currently being processed within the current input file.

@return [String]

# File lib/wukong-hadoop/hadoop_env_methods.rb, line 51
def map_input_start_offset
  ENV['map_input_start']
end