class OodCore::Job::Adapters::Slurm
An adapter object that describes the communication with a Slurm
resource manager for job management.
Constants
- STATE_MAP
Mapping of state codes for
Slurm
Public Class Methods
@api private @param opts [#to_h] the options defining this adapter @option opts [Batch] :slurm The Slurm
batch object @see Factory.build_slurm
# File lib/ood_core/job/adapters/slurm.rb, line 371 def initialize(opts = {}) o = opts.to_h.symbolize_keys @slurm = o.fetch(:slurm) { raise ArgumentError, "No slurm object specified. Missing argument: slurm" } end
Public Instance Methods
Delete the submitted job @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong deleting a job @return [void] @see Adapter#delete
# File lib/ood_core/job/adapters/slurm.rb, line 568 def delete(id) @slurm.delete_job(id.to_s) rescue Batch::Error => e # assume successful job deletion if can't find job id raise JobAdapterError, e.message unless /Invalid job id specified/ =~ e.message end
# File lib/ood_core/job/adapters/slurm.rb, line 575 def directive_prefix '#SBATCH' end
Put the submitted job on hold @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong holding a job @return [void] @see Adapter#hold
# File lib/ood_core/job/adapters/slurm.rb, line 544 def hold(id) @slurm.hold_job(id.to_s) rescue Batch::Error => e # assume successful job hold if can't find job id raise JobAdapterError, e.message unless /Invalid job id specified/ =~ e.message end
Retrieve job info from the resource manager @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong getting job info @return [Info] information describing submitted job @see Adapter#info
# File lib/ood_core/job/adapters/slurm.rb, line 474 def info(id) id = id.to_s info_ary = @slurm.get_jobs(id: id).map do |v| parse_job_info(v) end # If no job was found we assume that it has completed info_ary.empty? ? Info.new(id: id, status: :completed) : handle_job_array(info_ary, id) rescue Batch::Error => e # set completed status if can't find job id if /Invalid job id specified/ =~ e.message Info.new( id: id, status: :completed ) else raise JobAdapterError, e.message end end
Retrieve info for all jobs from the resource manager @raise [JobAdapterError] if something goes wrong getting job info @return [Array<Info>] information describing submitted jobs @see Adapter#info_all
# File lib/ood_core/job/adapters/slurm.rb, line 461 def info_all(attrs: nil) @slurm.get_jobs(attrs: attrs).map do |v| parse_job_info(v) end rescue Batch::Error => e raise JobAdapterError, e.message end
Retrieve info for all jobs for a given owner or owners from the resource manager @param owner [#to_s, Array<#to_s>] the owner(s) of the jobs @raise [JobAdapterError] if something goes wrong getting job info @return [Array<Info>] information describing submitted jobs
# File lib/ood_core/job/adapters/slurm.rb, line 499 def info_where_owner(owner, attrs: nil) owner = Array.wrap(owner).map(&:to_s).join(',') @slurm.get_jobs(owner: owner).map do |v| parse_job_info(v) end rescue Batch::Error => e raise JobAdapterError, e.message end
Release the job that is on hold @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong releasing a job @return [void] @see Adapter#release
# File lib/ood_core/job/adapters/slurm.rb, line 556 def release(id) @slurm.release_job(id.to_s) rescue Batch::Error => e # assume successful job release if can't find job id raise JobAdapterError, e.message unless /Invalid job id specified/ =~ e.message end
Retrieve job status from resource manager @param id [#to_s] the id of the job @raise [JobAdapterError] if something goes wrong getting job status @return [Status] status of job @see Adapter#status
# File lib/ood_core/job/adapters/slurm.rb, line 513 def status(id) id = id.to_s jobs = @slurm.get_jobs( id: id, attrs: [:job_id, :array_job_task_id, :state_compact] ) # A job id can return multiple jobs if it corresponds to a job array # id, so we need to find the job that corresponds to the given job id # (if we can't find it, we assume it has completed) # # Match against the job id or the formatted job & task id "1234_0" if job = jobs.detect { |j| j[:job_id] == id || j[:array_job_task_id] == id } Status.new(state: get_state(job[:state_compact])) else # set completed status if can't find job id Status.new(state: :completed) end rescue Batch::Error => e # set completed status if can't find job id if /Invalid job id specified/ =~ e.message Status.new(state: :completed) else raise JobAdapterError, e.message end end
Submit a job with the attributes defined in the job template instance @param script [Script] script object that describes the script and
attributes for the submitted job
@param after [#to_s, Array<#to_s>] this job may be scheduled for
execution at any point after dependent jobs have started execution
@param afterok [#to_s, Array<#to_s>] this job may be scheduled for
execution only after dependent jobs have terminated with no errors
@param afternotok [#to_s, Array<#to_s>] this job may be scheduled for
execution only after dependent jobs have terminated with errors
@param afterany [#to_s, Array<#to_s>] this job may be scheduled for
execution after dependent jobs have terminated
@raise [JobAdapterError] if something goes wrong submitting a job @return [String] the job id returned after successfully submitting a
job
@see Adapter#submit
# File lib/ood_core/job/adapters/slurm.rb, line 392 def submit(script, after: [], afterok: [], afternotok: [], afterany: []) after = Array(after).map(&:to_s) afterok = Array(afterok).map(&:to_s) afternotok = Array(afternotok).map(&:to_s) afterany = Array(afterany).map(&:to_s) # Set sbatch options args = [] # ignore args, don't know how to do this for slurm args.concat ["-H"] if script.submit_as_hold args.concat (script.rerunnable ? ["--requeue"] : ["--no-requeue"]) unless script.rerunnable.nil? args.concat ["-D", script.workdir.to_s] unless script.workdir.nil? args.concat ["--mail-user", script.email.join(",")] unless script.email.nil? if script.email_on_started && script.email_on_terminated args.concat ["--mail-type", "ALL"] elsif script.email_on_started args.concat ["--mail-type", "BEGIN"] elsif script.email_on_terminated args.concat ["--mail-type", "END"] elsif script.email_on_started == false && script.email_on_terminated == false args.concat ["--mail-type", "NONE"] end args.concat ["-J", script.job_name] unless script.job_name.nil? args.concat ["-i", script.input_path] unless script.input_path.nil? args.concat ["-o", script.output_path] unless script.output_path.nil? args.concat ["-e", script.error_path] unless script.error_path.nil? args.concat ["--reservation", script.reservation_id] unless script.reservation_id.nil? args.concat ["-p", script.queue_name] unless script.queue_name.nil? args.concat ["--priority", script.priority] unless script.priority.nil? args.concat ["--begin", script.start_time.localtime.strftime("%C%y-%m-%dT%H:%M:%S")] unless script.start_time.nil? args.concat ["-A", script.accounting_id] unless script.accounting_id.nil? args.concat ["-t", seconds_to_duration(script.wall_time)] unless script.wall_time.nil? args.concat ['-a', script.job_array_request] unless script.job_array_request.nil? args.concat ['--qos', script.qos] unless script.qos.nil? args.concat ['--gpus-per-node', script.gpus_per_node] unless script.gpus_per_node.nil? # ignore nodes, don't know how to do this for slurm # Set dependencies depend = [] depend << "after:#{after.join(":")}" unless after.empty? depend << "afterok:#{afterok.join(":")}" unless afterok.empty? depend << "afternotok:#{afternotok.join(":")}" unless afternotok.empty? depend << "afterany:#{afterany.join(":")}" unless afterany.empty? args.concat ["-d", depend.join(",")] unless depend.empty? # Set environment variables env = script.job_environment || {} args.concat ["--export", export_arg(env, script.copy_environment?)] # Set native options args.concat script.native if script.native # Set content content = if script.shell_path.nil? script.content else "#!#{script.shell_path}\n#{script.content}" end # Submit job @slurm.submit_string(content, args: args, env: env) rescue Batch::Error => e raise JobAdapterError, e.message end
Private Instance Methods
Convert duration to seconds
# File lib/ood_core/job/adapters/slurm.rb, line 581 def duration_in_seconds(time) return 0 if time.nil? time, days = time.split("-").reverse days.to_i * 24 * 3600 + time.split(':').map { |v| v.to_i }.inject(0) { |total, v| total * 60 + v } end
we default to export NONE, but SLURM defaults to ALL. we do this bc SLURM setups a new environment, loading /etc/profile and all giving 'module' function (among other things shells give), where the PUN did not. –export=ALL export the PUN's environment.
# File lib/ood_core/job/adapters/slurm.rb, line 679 def export_arg(env, copy_environment) if !env.empty? && !copy_environment env.keys.join(",") elsif !env.empty? && copy_environment "ALL," + env.keys.join(",") elsif env.empty? && copy_environment # only this option changes behaivor dramatically "ALL" else "NONE" end end
Determine state from Slurm
state code
# File lib/ood_core/job/adapters/slurm.rb, line 616 def get_state(st) STATE_MAP.fetch(st, :undetermined) end
# File lib/ood_core/job/adapters/slurm.rb, line 655 def handle_job_array(info_ary, id) # If only one job was returned we return it return info_ary.first unless info_ary.length > 1 parent_task_hash = {:tasks => []} info_ary.map do |task_info| parent_task_hash[:tasks] << {:id => task_info.id, :status => task_info.status} if task_info.id == id || task_info.native[:array_job_task_id] == id # Merge hashes without clobbering the child tasks parent_task_hash.merge!(task_info.to_h.select{|k, v| k != :tasks}) end end Info.new(**parent_task_hash) end
Replace '(null)' with nil
# File lib/ood_core/job/adapters/slurm.rb, line 651 def handle_null_account(account) (account != '(null)') ? account : nil end
Parse hash describing Slurm
job status
# File lib/ood_core/job/adapters/slurm.rb, line 621 def parse_job_info(v) allocated_nodes = parse_nodes(v[:node_list]) if allocated_nodes.empty? if v[:scheduled_nodes] && v[:scheduled_nodes] != "(null)" allocated_nodes = parse_nodes(v[:scheduled_nodes]) else allocated_nodes = [ { name: nil } ] * v[:nodes].to_i end end Info.new( id: v[:job_id], status: get_state(v[:state_compact]), allocated_nodes: allocated_nodes, submit_host: nil, job_name: v[:job_name], job_owner: v[:user], accounting_id: handle_null_account(v[:account]), procs: v[:cpus], queue_name: v[:partition], wallclock_time: duration_in_seconds(v[:time_used]), wallclock_limit: duration_in_seconds(v[:time_limit]), cpu_time: nil, submission_time: v[:submit_time] ? Time.parse(v[:submit_time]) : nil, dispatch_time: (v[:start_time].nil? || v[:start_time] == "N/A") ? nil : Time.parse(v[:start_time]), native: v ) end
Convert host list string to individual nodes “em082” “em” “c457-” “c438-” “c427-032,c429-002”
# File lib/ood_core/job/adapters/slurm.rb, line 599 def parse_nodes(node_list) node_list.to_s.scan(/([^,\[]+)(?:\[([^\]]+)\])?/).map do |prefix, range| if range range.split(",").map do |x| x =~ /^(\d+)-(\d+)$/ ? ($1..$2).to_a : x end.flatten.map do |n| { name: prefix + n, procs: nil } end elsif prefix [ { name: prefix, procs: nil } ] else [] end end.flatten end
Convert seconds to duration
# File lib/ood_core/job/adapters/slurm.rb, line 589 def seconds_to_duration(time) "%02d:%02d:%02d" % [time/3600, time/60%60, time%60] end