Metrics: Understanding System Behavior

Updated 4 months ago by Shoreline

In the Resources article, we learned how to discover resources, filter on them, drill down for more detail, and learn what the relationships are.  But beyond just discovering the resources, we also want to know how our resources are operating.  We want to know host cpu, disk, memory, and network utilization.  We also want to know service latency and error rates.  To get this information, we need metrics. Op metrics are time series data with name, resource, and other tag metadata. Let's get the value for a metric right now.  Let’s get the average cpu utilization for each of our hosts for the last 30 seconds:

op> host | cpu_usage_pre | window(30s) | resolution=10 | real_time=true | mean(3)

Op resources and metrics naturally mesh together.  Prefixing a metric query with a resource query narrows the metrics returned down to only those metrics associated with the returned resources.

Defining New Metrics

Like resources, we can also define and save useful metric queries for later.  Let's define the concept of average cpu over an interval:

op> metric cpu_2_min = cpu_usage | window(120s) | real_time=true

As shown above, our definition statements are fully parameterized.  Op's macro system allows for expression of complex substitution, but with a familiar syntax.

Now we can leverage both of the Op commands we have built up.  Let's get the average cpu utilization over the last 30 seconds for each of our hosts:

op> host | cpu_2_min

The above examples really show the power of Op.  Multiple layers of substitution allow you to very succinctly express a complex query.  And in the heat of a Sev 1 or similarly critical incident, you will be able to easily remember the commands and execute efficiently.

Listing Existing Metrics

Op allows the user to list all previously defined metrics. Imagine you are interactively debugging your cluster using Op, and you want the cpu_2_min metric, defined above, on each of your hosts, but you don’t remember the name of the defined metric. Let’s list our existing metrics.

op> list metric

We see that we have an entry in the table for our cpu_2_min metric. We know what this metric represents since we defined it and provided it with an appropriate name. However, what if someone new to the ops team wants more information about the metric? Let’s add a description to our metric.

op> cpu_2_min.description = “real time cpu usage over the last 2 minutes”

Now, if someone lists the defined metrics:

op> list metric

They will now see our metric description alongside the metric name and formula. Op supports standard CRUD (create, read, update, and delete) operations over metric definitions. Please refer to the Op Functions Glossary for more information regarding syntax and supported operations. These features enable an operations team to rapidly build up a shared statement bank of commonly used metrics, which allows for the dissemination of operations knowledge and the ability to quickly gather valuable system information without wasting time reinventing or misremembering commonly used metric formulas.

Metrics Exporter Support

Op plugs directly into the Prometheus exporter ecosystem. Op can pull metrics from any Prometheus exporter as well as Prometheus itself.  In future releases, exporters such as Envoy and cAdvisor will be auto-discovered and ingestible by the Shoreline agent.

For further information on Envoy, please see Envoy Overview.

For examples of cAdvisor metrics, please see Monitoring container metrics using cAdvisor.

How did we do?