class OodCore::Job::Adapters::Slurm

An adapter object that describes the communication with a Slurm resource manager for job management.

Constants

STATE_MAP

Mapping of state codes for Slurm

Public Class Methods

new(opts = {}) click to toggle source

@api private @param opts [#to_h] the options defining this adapter @option opts [Batch] :slurm The Slurm batch object @see Factory.build_slurm

# File lib/ood_core/job/adapters/slurm.rb, line 371
def initialize(opts = {})
  o = opts.to_h.symbolize_keys

  @slurm = o.fetch(:slurm) { raise ArgumentError, "No slurm object specified. Missing argument: slurm" }
end

Public Instance Methods

delete(id) click to toggle source

Delete the submitted job @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong deleting a job @return [void] @see Adapter#delete

# File lib/ood_core/job/adapters/slurm.rb, line 568
def delete(id)
  @slurm.delete_job(id.to_s)
rescue Batch::Error => e
  # assume successful job deletion if can't find job id
  raise JobAdapterError, e.message unless /Invalid job id specified/ =~ e.message
end
directive_prefix() click to toggle source
# File lib/ood_core/job/adapters/slurm.rb, line 575
def directive_prefix
  '#SBATCH'
end
hold(id) click to toggle source

Put the submitted job on hold @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong holding a job @return [void] @see Adapter#hold

# File lib/ood_core/job/adapters/slurm.rb, line 544
def hold(id)
  @slurm.hold_job(id.to_s)
rescue Batch::Error => e
  # assume successful job hold if can't find job id
  raise JobAdapterError, e.message unless /Invalid job id specified/ =~ e.message
end
info(id) click to toggle source

Retrieve job info from the resource manager @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong getting job info @return [Info] information describing submitted job @see Adapter#info

# File lib/ood_core/job/adapters/slurm.rb, line 474
def info(id)
  id = id.to_s
  info_ary = @slurm.get_jobs(id: id).map do |v|
    parse_job_info(v)
  end

  # If no job was found we assume that it has completed
  info_ary.empty? ? Info.new(id: id, status: :completed) : handle_job_array(info_ary, id)
rescue Batch::Error => e
  # set completed status if can't find job id
  if /Invalid job id specified/ =~ e.message
    Info.new(
      id: id,
      status: :completed
    )
  else
    raise JobAdapterError, e.message
  end
end
info_all(attrs: nil) click to toggle source

Retrieve info for all jobs from the resource manager @raise [JobAdapterError] if something goes wrong getting job info @return [Array<Info>] information describing submitted jobs @see Adapter#info_all

# File lib/ood_core/job/adapters/slurm.rb, line 461
def info_all(attrs: nil)
  @slurm.get_jobs(attrs: attrs).map do |v|
    parse_job_info(v)
  end
rescue Batch::Error => e
  raise JobAdapterError, e.message
end
info_where_owner(owner, attrs: nil) click to toggle source

Retrieve info for all jobs for a given owner or owners from the resource manager @param owner [#to_s, Array<#to_s>] the owner(s) of the jobs @raise [JobAdapterError] if something goes wrong getting job info @return [Array<Info>] information describing submitted jobs

# File lib/ood_core/job/adapters/slurm.rb, line 499
def info_where_owner(owner, attrs: nil)
  owner = Array.wrap(owner).map(&:to_s).join(',')
  @slurm.get_jobs(owner: owner).map do |v|
    parse_job_info(v)
  end
rescue Batch::Error => e
  raise JobAdapterError, e.message
end
release(id) click to toggle source

Release the job that is on hold @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong releasing a job @return [void] @see Adapter#release

# File lib/ood_core/job/adapters/slurm.rb, line 556
def release(id)
  @slurm.release_job(id.to_s)
rescue Batch::Error => e
  # assume successful job release if can't find job id
  raise JobAdapterError, e.message unless /Invalid job id specified/ =~ e.message
end
status(id) click to toggle source

Retrieve job status from resource manager @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong getting job status @return [Status] status of job @see Adapter#status

# File lib/ood_core/job/adapters/slurm.rb, line 513
def status(id)
  id = id.to_s
  jobs = @slurm.get_jobs(
    id: id,
    attrs: [:job_id, :array_job_task_id, :state_compact]
  )
  # A job id can return multiple jobs if it corresponds to a job array
  # id, so we need to find the job that corresponds to the given job id
  # (if we can't find it, we assume it has completed)
  #
  # Match against the job id or the formatted job & task id "1234_0"
  if job = jobs.detect { |j| j[:job_id] == id || j[:array_job_task_id] == id }
    Status.new(state: get_state(job[:state_compact]))
  else
    # set completed status if can't find job id
    Status.new(state: :completed)
  end
rescue Batch::Error => e
  # set completed status if can't find job id
  if /Invalid job id specified/ =~ e.message
    Status.new(state: :completed)
  else
    raise JobAdapterError, e.message
  end
end
submit(script, after: [], afterok: [], afternotok: [], afterany: []) click to toggle source

Submit a job with the attributes defined in the job template instance @param script [Script] script object that describes the script and

attributes for the submitted job

@param after [#to_s, Array<#to_s>] this job may be scheduled for

execution at any point after dependent jobs have started execution

@param afterok [#to_s, Array<#to_s>] this job may be scheduled for

execution only after dependent jobs have terminated with no errors

@param afternotok [#to_s, Array<#to_s>] this job may be scheduled for

execution only after dependent jobs have terminated with errors

@param afterany [#to_s, Array<#to_s>] this job may be scheduled for

execution after dependent jobs have terminated

@raise [JobAdapterError] if something goes wrong submitting a job @return [String] the job id returned after successfully submitting a

job

@see Adapter#submit

# File lib/ood_core/job/adapters/slurm.rb, line 392
def submit(script, after: [], afterok: [], afternotok: [], afterany: [])
  after      = Array(after).map(&:to_s)
  afterok    = Array(afterok).map(&:to_s)
  afternotok = Array(afternotok).map(&:to_s)
  afterany   = Array(afterany).map(&:to_s)

  # Set sbatch options
  args = []
  # ignore args, don't know how to do this for slurm
  args.concat ["-H"] if script.submit_as_hold
  args.concat (script.rerunnable ? ["--requeue"] : ["--no-requeue"]) unless script.rerunnable.nil?
  args.concat ["-D", script.workdir.to_s] unless script.workdir.nil?
  args.concat ["--mail-user", script.email.join(",")] unless script.email.nil?
  if script.email_on_started && script.email_on_terminated
    args.concat ["--mail-type", "ALL"]
  elsif script.email_on_started
    args.concat ["--mail-type", "BEGIN"]
  elsif script.email_on_terminated
    args.concat ["--mail-type", "END"]
  elsif script.email_on_started == false && script.email_on_terminated == false
    args.concat ["--mail-type", "NONE"]
  end
  args.concat ["-J", script.job_name] unless script.job_name.nil?
  args.concat ["-i", script.input_path] unless script.input_path.nil?
  args.concat ["-o", script.output_path] unless script.output_path.nil?
  args.concat ["-e", script.error_path] unless script.error_path.nil?
  args.concat ["--reservation", script.reservation_id] unless script.reservation_id.nil?
  args.concat ["-p", script.queue_name] unless script.queue_name.nil?
  args.concat ["--priority", script.priority] unless script.priority.nil?
  args.concat ["--begin", script.start_time.localtime.strftime("%C%y-%m-%dT%H:%M:%S")] unless script.start_time.nil?
  args.concat ["-A", script.accounting_id] unless script.accounting_id.nil?
  args.concat ["-t", seconds_to_duration(script.wall_time)] unless script.wall_time.nil?
  args.concat ['-a', script.job_array_request] unless script.job_array_request.nil?
  args.concat ['--qos', script.qos] unless script.qos.nil?
  args.concat ['--gpus-per-node', script.gpus_per_node] unless script.gpus_per_node.nil?
  # ignore nodes, don't know how to do this for slurm

  # Set dependencies
  depend = []
  depend << "after:#{after.join(":")}"           unless after.empty?
  depend << "afterok:#{afterok.join(":")}"       unless afterok.empty?
  depend << "afternotok:#{afternotok.join(":")}" unless afternotok.empty?
  depend << "afterany:#{afterany.join(":")}"     unless afterany.empty?
  args.concat ["-d", depend.join(",")]               unless depend.empty?

  # Set environment variables
  env = script.job_environment || {}
  args.concat ["--export", export_arg(env, script.copy_environment?)]

  # Set native options
  args.concat script.native if script.native

  # Set content
  content = if script.shell_path.nil?
              script.content
            else
              "#!#{script.shell_path}\n#{script.content}"
            end

  # Submit job
  @slurm.submit_string(content, args: args, env: env)
rescue Batch::Error => e
  raise JobAdapterError, e.message
end

Private Instance Methods

duration_in_seconds(time) click to toggle source

Convert duration to seconds

# File lib/ood_core/job/adapters/slurm.rb, line 581
def duration_in_seconds(time)
  return 0 if time.nil?
  time, days = time.split("-").reverse
  days.to_i * 24 * 3600 +
    time.split(':').map { |v| v.to_i }.inject(0) { |total, v| total * 60 + v }
end
export_arg(env, copy_environment) click to toggle source

we default to export NONE, but SLURM defaults to ALL. we do this bc SLURM setups a new environment, loading /etc/profile and all giving 'module' function (among other things shells give), where the PUN did not. –export=ALL export the PUN's environment.

# File lib/ood_core/job/adapters/slurm.rb, line 679
def export_arg(env, copy_environment)
  if !env.empty? && !copy_environment
    env.keys.join(",")
  elsif !env.empty? && copy_environment
    "ALL," + env.keys.join(",")
  elsif env.empty? && copy_environment
    # only this option changes behaivor dramatically
    "ALL"
  else
    "NONE"
  end
end
get_state(st) click to toggle source

Determine state from Slurm state code

# File lib/ood_core/job/adapters/slurm.rb, line 616
def get_state(st)
  STATE_MAP.fetch(st, :undetermined)
end
handle_job_array(info_ary, id) click to toggle source
# File lib/ood_core/job/adapters/slurm.rb, line 655
def handle_job_array(info_ary, id)
  # If only one job was returned we return it
  return info_ary.first unless info_ary.length > 1

  parent_task_hash = {:tasks => []}

  info_ary.map do |task_info|
    parent_task_hash[:tasks] << {:id => task_info.id, :status => task_info.status}

    if task_info.id == id || task_info.native[:array_job_task_id] == id
      # Merge hashes without clobbering the child tasks
      parent_task_hash.merge!(task_info.to_h.select{|k, v| k != :tasks})
    end
  end

  Info.new(**parent_task_hash)
end
handle_null_account(account) click to toggle source

Replace '(null)' with nil

# File lib/ood_core/job/adapters/slurm.rb, line 651
def handle_null_account(account)
  (account != '(null)') ? account : nil
end
parse_job_info(v) click to toggle source

Parse hash describing Slurm job status

# File lib/ood_core/job/adapters/slurm.rb, line 621
def parse_job_info(v)
  allocated_nodes = parse_nodes(v[:node_list])
  if allocated_nodes.empty?
    if v[:scheduled_nodes] && v[:scheduled_nodes] != "(null)"
      allocated_nodes = parse_nodes(v[:scheduled_nodes])
    else
      allocated_nodes = [ { name: nil } ] * v[:nodes].to_i
    end
  end

  Info.new(
    id: v[:job_id],
    status: get_state(v[:state_compact]),
    allocated_nodes: allocated_nodes,
    submit_host: nil,
    job_name: v[:job_name],
    job_owner: v[:user],
    accounting_id: handle_null_account(v[:account]),
    procs: v[:cpus],
    queue_name: v[:partition],
    wallclock_time: duration_in_seconds(v[:time_used]),
    wallclock_limit: duration_in_seconds(v[:time_limit]),
    cpu_time: nil,
    submission_time: v[:submit_time] ? Time.parse(v[:submit_time]) : nil,
    dispatch_time: (v[:start_time].nil? || v[:start_time] == "N/A") ? nil : Time.parse(v[:start_time]),
    native: v
  )
end
parse_nodes(node_list) click to toggle source

Convert host list string to individual nodes “em082” “em” “c457-” “c438-” “c427-032,c429-002”

# File lib/ood_core/job/adapters/slurm.rb, line 599
def parse_nodes(node_list)
  node_list.to_s.scan(/([^,\[]+)(?:\[([^\]]+)\])?/).map do |prefix, range|
    if range
      range.split(",").map do |x|
        x =~ /^(\d+)-(\d+)$/ ? ($1..$2).to_a : x
      end.flatten.map do |n|
        { name: prefix + n, procs: nil }
      end
    elsif prefix
      [ { name: prefix, procs: nil } ]
    else
      []
    end
  end.flatten
end
seconds_to_duration(time) click to toggle source

Convert seconds to duration

# File lib/ood_core/job/adapters/slurm.rb, line 589
def seconds_to_duration(time)
  "%02d:%02d:%02d" % [time/3600, time/60%60, time%60]
end