class Bosh::Monitor::Plugins::ResurrectorHelper::AlertTracker

Service which tracks alerts and decides whether or not the cluster is melting down. When the cluster is melting down, the resurrector backs off on fixing instances.

Attributes

minimum_down_jobs[RW]

Below this number of down agents we don't consider a meltdown occurring

percent_threshold[RW]

Percentage of the cluster which must be down for scanning to stop. Float fraction between 0 and 1.

time_threshold[RW]

Number of seconds at which an alert is considered “current”; alerts older than this are ignored. Integer number of seconds.

Public Class Methods

new(args={}) click to toggle source
# File lib/bosh/monitor/plugins/resurrector_helper.rb, line 44
def initialize(args={})
  @agent_manager       = Bhm.agent_manager
  @alert_times         = {} # maps JobInstanceKey to time of last Alert
  @minimum_down_jobs   = args.fetch('minimum_down_jobs', 5)
  @percent_threshold   = args.fetch('percent_threshold', 0.2)
  @time_threshold      = args.fetch('time_threshold', 600)
end

Public Instance Methods

melting_down?(deployment) click to toggle source

“Melting down” means a large part of the cluster is offline and manual intervention may be required to fix.

# File lib/bosh/monitor/plugins/resurrector_helper.rb, line 54
def melting_down?(deployment)
  agent_alerts = alerts_for_deployment(deployment)
  total_number_of_agents = agent_alerts.size
  number_of_down_agents = agent_alerts.select { |_, alert_time|
    alert_time > (Time.now - time_threshold)
  }.size

  return false if number_of_down_agents < minimum_down_jobs

  (number_of_down_agents.to_f / total_number_of_agents) >= percent_threshold
end
record(agent_key, alert_time) click to toggle source
# File lib/bosh/monitor/plugins/resurrector_helper.rb, line 66
def record(agent_key, alert_time)
  @alert_times[agent_key] = alert_time
end

Private Instance Methods

alerts_for_deployment(deployment) click to toggle source
# File lib/bosh/monitor/plugins/resurrector_helper.rb, line 72
def alerts_for_deployment(deployment)
  agents = @agent_manager.get_agents_for_deployment(deployment)
  keys = agents.values.map { |agent|
    JobInstanceKey.new(agent.deployment, agent.job, agent.instance_id)
  }

  result = {}
  keys.each { |key| result[key] = @alert_times.fetch(key, Time.at(0)) }
  result
end