Alarms: Determining When Something Has Gone Wrong
Querying metrics is useful, but as an operator, we often want to continuously monitor metrics, resources, and other system state. We watch for dangerous metrics levels constantly by checking dashboards. Typically, we run Kubectl, AWS, and Terraform commands to see which resources are deployed and we run shell commands to assess system health, on a manual basis.
Instead of manually performing these checks, we can create an alarm in Op. Alarms can be defined on metrics, resources, and system state. Let's create an alarm on a cpu metric that triggers whenever cpu usage exceeds 65% for over 15 seconds in a 30 second interval. The base Op command to configure this alarm looks like this and includes the following parameters:
op> alarm high_cpu_alarm = (cpu_usage > 65 | sum(30)) >= 15.0
- Alarm Name - must be alphanumeric, use underscores and/or dashes, and globally unique
- fire_query - this is the condition that triggers the alarm. It must be a valid Op statement.
The above loads an alarm called high_cpu_alarm. But more details are needed (which are added by executing additional Op commands) to fully define and enable the alarm:
- clear_query - what condition clears the alarm?
op> high_cpu_alarm.clear_query = (cpu_usage < 65 | sum(30)) >= 15.0
- resource_query - what resources is this alarm acting upon (host, pod, or container) ?
op> high_cpu_alarm.resource_query = host
- resource_type - what is the specific resource type (HOST, POD, CONTAINER)?
op> high_cpu_alarm.resource_type = “HOST"
- raise_for - only local alarm is currently supported
op> high_cpu_alarm.raise_for = “local"
- raise_family - to make this alarm interchangeble and actionable between the CLI and UI, add family type of "custom"
op> high_cpu_alarm.raise_family = “custom"
- metric_name - what specific resource metric is being monitored?
op> high_cpu_alarm.metric_name = “cpu_usage"
- condition_type - is the condition defined as above or below the threshold?
op> high_cpu_alarm.condition_type = “above"
- condition_value - what is the condition value?
op> high_cpu_alarm.condition_value = “65"
Executing these Op commands updates the alarm.
By default, all defined objects in Op start in a disabled state. We want to make sure that we only perform authorized and fully prepared computation. Op is all about safety and reliability. With the definition, clear condition, and resources statements, our alarm is ready to be deployed. To do this, we enable our alarm so that it runs on each of our hosts, constantly checking if cpu usage has exceeded the threshold:
- enable - the default is false
op> enable high_cpu_alarm
To verify that the alarm is defined and enabled, we use the List command:
op> list alarm | name="high_cpu_alarm"
Our new alarm is also visible in the Shoreline UI. But it needs more information in order to be synchronized with and editable in the UI. This step is optional but highly so that operators can leverage both the CLI and UI seamlessly.
The Op commands to synchronize this alarm to the UI are:
op> high_cpu_alarm.description = "cpu_usage exceeds 65% 15s out of 30s"
op> high_cpu_alarm.fire_title_template = "cpu usage above threshold"
op> high_cpu_alarm.fire_short_template = "cpu usage above threshold"
op> high_cpu_alarm.resolve_title_template = "cpu usage below threshold"
op> high_cpu_alarm.resolve_short_template = "cpu usage below threshold"
The alarm is now completely configured to be managed from both the CLI and the UI.
To see all the configured alarms, use the List command.
op> list alarm