Getting Started With Op
Op has five core types of objects:
- Resources - the infrastructure objects in your environment.
- Metrics - time series data data associated with your resources such as CPU utilization, latency, throughput, or error rate.
- Actions - shell commands and shell scripts that act on your resources.
- Alarms - checks, built on metrics and shell commands, that determine if the state of your resources is in issue.
- Bots - bind alarms and actions together: when an alarm fires a mitigating action is automatically run to fix the issue
We will start our Op walkthrough with Resources; each of these objects will be explained in detail in separate sections.
Introduction to Resources
Resources can be hosts, pods, containers, virtual machines, database instances etc. On different platforms, we can have different resources e.g. pods and containers for Kubernetes and virtual machines for AWS, GCP, or Azure. Op explicitly models resources and their relationships. Let's start off with a simple resource query from the command line:
Above you can see each of the host resources that appear in your environment. In this example, we have 3 AWS EC2 instances, all in the us-west-2 region, different availability zones. Each of these hosts is running a Shoreline agent. The agent is a program which communicates to the Shoreline backend, sends up telemetry and other environmental data, and receives various commands to carry out. In the above output, you can see each resource's name, region, and zone. Every object in Shoreline has a globally unique id and a human readable name.
Objects also have many other fields and you can configure their output. For example, hosts have tags that encode information such as geography or workload. We can interactively adjust this output for the session.
In this example, we have resources in AWS that have tags. Tags are name-value pairs that are used to categorize and organize resources. Let's print out the az tag (the availability zone tag, currently called “topology.kubernetes.io/zone”) for each host.
The resources_tags() function tells us what kind of tags our resources have. For example, running resources_tags(type=”HOST”) shows that we have a region and a zone tag for our hosts (among other tags). We can use this function to find existing tags and to better filter the resources we’re looking for.
Note: Currently, the CLI only outputs the JSON response for this function. Formatting the output is planned for a future release of Op.
So how can you find a particular tag and filter by it and display the column in the CLI?
op> config hide resources.fields.sl_TestLabel
To change it permanently you can add it to the config (.ops.yaml), e.g.
NOTE: Slash characters '/' are converted to dots '.' .
NOTE: There is also a provision for shortening resource tags (e.g. "kubernetes.io" -> "kube").
Let's look at the alarms, actions and bots associated with your resources. To do this, we use the list command.
op> list alarm
op> list action
op> list bot
For each of the above commands, we see the default output columns: name of the definition, description, and the formula. For alarms, we also see the more complex structure including resources to run the alarms on, and the clear query to resolve the alarm. The names of each of these objects is the name that we can refer to the object by. These names are globally unique.
Above we ran a resource query to actually query for our resources. We can also get the definitions of our resource and metric queries, just like we do for alarms, actions, and bots:
op> list resource
op> list metric
This command lists the metric, its description, and underlying formula. As you can see in this example, Op allows you to define custom and complex metrics, specific to your environment.
Metric and resource definitions both support symbolic manipulation. As you can see above, the definition of host is actually resources(type="HOST"). What this means is that when you typed in host above, the host symbol was translated to that resource query. We will do much more of this later. Op fully supports user defined substitutions.
The Op Pipe Operator
The next major language feature of Op is the pipe '|' operator. The pipe operator looks just the commonly used operator from the Bash shell and it behaves in a similar way. For example, let's filter our hosts down to only those in the availability zone us-west-2b.
Like the conventional pipe operator from Bash, the pipe operator allows us to define processing pipelines that filter our results as they are passed through. But this pipe operator has some other features as well. Let's combine together two different type expressions. Let's get all our hosts in us-west-2b and then get their current CPU utilization for the last 5 seconds:
op> host | az = "us-west-2b" | cpu_usage | window(5s) | real_time=true
Op allows us to bring together different types of objects into a pipeline. Specifically, Op pipelines support everything an operator needs in order to do interactive debugging: resources, metrics, and shell commands. For example, to get all of the pods, the containers scheduled to the host, and what they are running for each host running over 80% cpu utilization, we would use the following command with 'ps':
(op> host | cpu_usage | window(1s) > 0.8 | .pod | .container | `ps aux`)
How did Op do this? We would use the explain command to see what Op did:
(op> explain host | cpu_usage | window(1s) > 0.8 | .pod | .container | `ps aux`)
Like SQL, explain shows what Op would do, but doesn't actually run the statement. Instead, it shows which steps the plan would run, on which hosts, and it what order. A step is a single piece of work that is performed. Steps are dependent on each other; hence, they enforce an order of execution. Op handles the process of determining the appropriate plan for your statement by decomposing it into steps to carry out what you define. Each step has a sequence number. Each step depends on the execution of certain sequence numbers. As the steps execute, an Op object called a Scope is built up i.e. the scope is a summary of all the computation performed by the statement.
Here are the steps for the above plan:
- The backend service runs the resource query step to determine which hosts are in us-west-2a.
- A metric query step is run on each of these hosts to fetch the live CPU usage.
- For each host where the metric query exceeds 80%, a local resource query step is run to obtain the pods and containers. Finally, a Linux command step "ps aux" is run in each of these containers.