Tying Everything Together: Op Self-Tutorial

Updated 1 week ago by Shoreline

In this 15 minute tutorial, we'll cover the fundamental building blocks of Shoreline. This tutorial will focus on Op: Shoreline's domain-specific language for operations. Op allows operators (SREs, DevOps, SysAdmin, and SWE) to:

  1. Interactively debug their systems
  2. Create automated remediations to detect and mitigate issues without operator involvement

This tutorial will focus on Shoreline running in a Kubernetes environment on AWS. As of writing, Shoreline also runs on AWS VMs directly, however, that will not be covered.

Resources

Operators care about the resources they are responsible for. To that end, Shoreline discovers the hosts, pods, and containers in your environment. Let's begin with a basic resource query to see all of our hosts.

op> host

Not only does Shoreline discover your resources, it captures their metadata as well. Let's see all of fields that we have captured.

op> columns host

That's quite a few fields: shoreline captures metadata - tags from ec2 as well as Kubernetes - and makes it all available in one place. Let's turn on the az tag and then see the availability zone of each of our hosts.

op> config show resources.fields.az
op> host

We can filter on tags as well. Let's fetch only those hosts in the us-west-2c availability zone.

op> host | az="us-west-2c"

The pipe operator is inspired by shell's pipe - you use it chain together statements in the Op, passing data from one to the next. In the above, a tag filter placed after a resource query filters down the hosts to only those whose tags match.

Let's combine more statements. Op also discovers resource topology e.g. which pods are scheduled to which hosts and which containers sit in those pods. Let's fetch all of the pods scheduled to those hosts in zone us-west-2c:

op> host | az="us-west-2c" | .pod 

In shell, we often create aliases for commands that we frequently run. Inspired by this, Op lets you define stored statements. Let's create a stored resource query to fetch all the hosts in us-west-2c.

op> resource uswest2c = host | az="us-west-2c"
op> uswest2c

Running the above, the uswest2c symbol now resolves to all hosts in that zone. We can also attach metadata to these symbols and list all symbols stored in the collective memory.

op> uswest2c.description = "Hosts in us-west-2c"
op> list resource

These symbols are accessible to all operators. This prevents the issue of commands being trapped in folks' heads or growing out of date on wikis.

Metrics

Operators care about which resources they are responsible for and how these resources are performing. To that end, metrics are integrated into Shoreline as a first class entity. We can combine together resource and metric queries using the pipe operator. Let's fetch the cpu usage for all of our hosts:

op> host | cpu_usage

Like Prometheus, Shoreline polls Kubernetes exporters to ingest metrics making it easy to place into a Kubernetes environment. Shoreline also supports deploying to virtual machines, and in that environment, Shoreline installs the same exporters to gather metrics.

Op contains an entire metrics expression language. The entire syntax is outside of the scope of this tutorial, but we can do some simple manipulations of our cpu metric. Let's grab it for the last 30 seconds:

op> host | cpu_usage | window(30s)

Now let's take the average for each resource:

op> host | cpu_usage | window(30s) | mean(30)

Op also supports dynamic filter. For example, we can filter all our hosts whose cpu_usage is currently greater than 10%.

op> host | filter(cpu_usage > 10)

Try raising the threshold higher and higher. You'll see more and more hosts filtered out.

op> host | filter(cpu_usage > 90)

Like resource query, we can also see the definitions of each of our stored metrics. This is a pretty large list - shoreline comes preloaded with common metric queries for cAdvisor and node exporters. We can filter these by name, matching on prefix.

op> list metric
op> list metric | name="container"

Actions

While Shoreline is useful as a resource inventory and observability tool, it really shines as a mitigation tool i.e. shoreline lets you take action. To that end, Shoreline supports executing distributed commands. Distributed commands can run on hosts or on containers. Let's start off by creating a resource query that returns our shoreline agent containers: we'll use these to test actions.

op> resource shoreline = host | .pod | app="shoreline" | .container
op> shoreline

Actions fully support shell: they are just snippets of shell. These snippets can even reference entire scripts. Let's run an action in each of our shoreline containers to order the processes descending by memory. Then we will limit those to just the top 10:

op> shoreline | `top -b -o -%MEM -n 1 | tail -n 10`

Like resources and metrics, we can also store actions as well. To do this, we define an action.

op> action top_memory = `top -b -o -%MEM -n 1 | tail -n 10`

We can now use it by name:

op> shoreline | top_memory

Again, like resources and metrics, we can attach metadata to our actions and list all of the stored actions:

op> top_memory.description = "Top processes by memory usage"
op> list action

Alarms

Individual resource, metric, and action are useful for interactive debugging, but the power to perform automated remediation comes from combining them together into higher constructs. The first of these is an alarm. Alarms can fire on metrics and/or Linux commands. This means you can create alarms that take into account fast moving metrics along with system state.

Let's create the most basic type of alarm, an alarm on a single metric. Let's define the alarm to fire when the cpu_usage for a host has exceeded 70%, 20 times in the last 30 seconds.

op> alarm high_cpu_alarm = ((cpu_usage > 70) | sum(30)) >= 20

Next, we need to define how to clear the alarm. Alarm clear and fire might not be symmetrical e.g. we might want it to be easier to enter alarm than to exit alarm so that we have some certainty that the alarm won't flap up and down.

In this case though, we'll define a symmetrical alarm, on just the opposite condition i.e. we'll clear the alarm when cpu usage is greater than 70% less than 20 times in the last 30 seconds.

op> high_cpu_alarm.clear_query = ((cpu_usage > 70) | sum(30)) < 20

Next, we need to instruct the alarm to monitor a set of our resources. Let's monitor all of the hosts by setting the alarm's resource query to host:

op> high_cpu_alarm.resource_query = host

Like the other objects that we discussed before, we can also attach metadata to our alarm:

op> high_cpu_alarm.description = "Alarm when host cpu usage is high"

Our alarm is now fully defined, but it is not yet running. To run the alarm, constantly checking to see if the condition is met, we need to enable it. There is a special command for this called enable:

op> enable high_cpu_alarm

Finally, like all other objects, we can list alarms to see what we have created:

op> list alarm

Bots

So far we have created an alarm, but nothing happens if the alarm fires. To solve that problem, we need to link an action to the alarm. To do that, we'll define a bot. Bots are if-this-then-that style structures that link alarms to actions. Let's create a bot that runs our top_memory action from before whenever the high_cpu_alarm fires:

op> bot check_memory_bot = if high_cpu_alarm then top_memory fi

As usual, we'll add a description to the bot:

op> check_memory_bot.description = "If host has high cpu, check shoreline container memory"

When the alarm fires, now the top_memory action will be executed. But where will it be executed? By default, Shoreline will try to execute the action on the same resource that triggered the alarm, in this case a host. Let's override that to execute the top_memory action in the shoreline container scheduled on the host that had the alarm.

op> top_memory.resource_query = shoreline

When an action is used in an alarm, we need to enable the action as well. This tells the system that it is safe to execute the action on your behalf. We'll also need to enable the bot so that it links the alarm and action:

op> enable top_memory
op> enable check_memory_bot

Finally, let's inspect our bot definitions before finishing the tutorial:

op> list bot

Conclusion

We hope that you found this to be a useful introduction to using Op and Shoreline. There are many more capabilities and parameters that were not covered in this tutorial. Those can be found throughout the rest of the documentation or by reaching out via slack or email. But, we hope that this gives you enough starting knowledge to begin automating operational tasks!


How did we do?