Transform Functions for the Metric Query

Updated 1 month ago by Shoreline

When monitoring and trouble-shooting, we typically work with data that is focused around a specific period of time.  We look for indicators and patterns that give us clues as to what potentially led up to and caused an incident.  The Op pipe operator (similar to the commonly used pipe operator from the Bash shell) allows us to filter, aggregate, and transform the results of our resource and metric queries.

For example:

  • op> host | metric_query (...) | sum(30) returns values from a window size of 30 data points and calculates the sum.
  • op> host | limit=2 gets resources of type HOST but limits the result to two resources
  • op> cpu_usage | real_time=true | window(30s) gets cpu usage from the last 30 seconds

When parameters are piped to metric and resource queries, they are viewed as functions; hence, we refer to them as transform functions.  Op has a very robust set of transform functions described in the following sections.

It’s important to note that when you apply transform functions to a metric query/defined metric interleaved with arithmetic/comparison, you must use parentheses to make it clear what the operands of the arithmetic expression are and what the transform functions are applying to.

For example, instead of:

  • metric_query(...) | sum(1) | window(2s) + metric_query(...) | sum(3) | window(4s)

You would use parentheses to group the operands to the + operator and make it explicit what the transform functions are applying to and use this:

  • (metric_query(...) | sum(1) | window(2s)) + (metric_query(...) | sum(3) | window(4s)).

So, if you wanted to then apply a transform function, like mean(5) to the result of the entire arithmetic expression, you would wrap the entire arithmetic expression in parentheses and apply the mean(5) like this:

  • ((metric_query(...) | sum(1) | window(2s)) + (metric_query(...) | sum(3) | window(4s))) | mean(5).

If you wanted to break the expression into smaller bite size chunks with fewer parentheses, you could create defined metrics for incremental components of the computation and glue them together, i.e define the following in the below order:

  • metric m1 = metric_query(...) | sum(1) | window(2s)
  • metric m2 = metric_query(...) | sum(3) | window(4s)
  • metric m3 = m1 + m2
  • metric m4 = m3 | mean(4)

Time Aggregates

The time aggregate transform functions take 3 parameters:

  • [aggregation_window_size]
  • [window_mode] (optional)
  • [drop_incomplete] (optional)

Aggregation Window Size: Each of these functions takes an aggregation window size (number of data points) as an argument. For example, if timestamps are [0, 1000, 2000, 3000, 4000, 5000] and values are [0, 1, 2, 3, 4, 5] then the sum aggregate with aggregation window size 3 gives timestamps [2000, 5000] and values [3, 12] ([0 + 1 + 2, 3 + 4 + 5]). A user would invoke this function by writing host | metric_query(...) | sum(3) to aggregate with aggregation function sum and aggregation window size 3.

Window Mode: Either “SLIDING” or FIXED”. The default is “SLIDING” if not specified. Setting window_mode="FIXED" results in the aggregation function being computed over non-overlapping blocks of data, each with aggregation_window_size data points. For example, sum(5, "FIXED") computes the sum over data points 1-5, 6-10, 11-15...etc. If window_mode="SLIDING", the aggregation function is computed over overlapping blocks of data, each with aggregation_window_size_data_points. For example, sum(5, "SLIDING") computes the sum over data points 1-5, 2-6, 3-7...etc.

Drop Incomplete: Either true or false. The default is false if not specified. Consider what happens if there are 10 data points in a series, and a sum time aggregate is applied to this series with an aggregation window size of 3 and a "FIXED" window mode. The last aggregation bucket will only contain a single data point, because 10/3 yields a remainder of 1. If drop_incomplete=true, then this last partially filled aggregation bucket will be dropped, and the aggregated series will only contain 3 data points (the sums over the first three buckets). If drop_incomplete=false, the aggregated series will contain 4 data points, because the sum will be computed over the first 3 full buckets along with the last partially filled bucket.

Here is example syntax for a sum time aggregate with all 3 parameters specified:
sum(aggregation_window_size=5, window_mode="FIXED", drop_incomplete=true). You can specify the parameters without their names, but they must be in the following order: [aggregation_window_size, window_mode, drop_incomplete]. Example syntax with all 3 parameters specified without name: sum(5, "FIXED", true).

Time Aggregate Name and Syntax Aggregation Function
sum<aggregation window size, window mode (optional), drop incomplete (optional)> computes the sum of values across the given aggregation window
Ex: op> cpu_usage | sum(5)
count<aggregation window size, window mode (optional), drop incomplete (optional)> computes the number of values across the given aggregation window
Ex: op> cpu_usage | count(5)
irate<aggregation window size, window mode (optional), drop incomplete (optional)> computes the per second rate of change between the values at the beginning and end of the aggregation window
Ex: op> cpu_usage | irate(5)
mean<aggregation window size, window mode (optional), drop incomplete (optional)> computes the mean of values across the given aggregation window
Ex: op> cpu_usage | mean(5)
max<aggregation window size, window mode (optional), drop incomplete (optional)> computes the max of values across the given aggregation window
Ex: op> cpu_usage | max(5)
min<aggregation window size, window mode (optional), drop incomplete (optional)> computes the min of values across the given aggregation window
Ex: op> cpu_usage | min(5)
stddev<aggregation window size, window mode (optional), drop incomplete (optional)> computes the standard deviation of values across the given aggregation window
Ex: op> cpu_usage | stddev(5)
p1<aggregation window size. window mode (optional), drop incomplete (optional)> computes the 1st percentile of values across the given aggregation window
Ex: op> cpu_usage | p1(5)
p10<aggregation window size, window mode (optional), drop incomplete (optional)> computes the 10th percentile of values across the given aggregation window
Ex: op> cpu_usage | p10(5)
p90<aggregation window size, window mode (optional), drop incomplete (optional)> computes the 90th percentile of values across the given aggregation window
Ex: op> cpu_usage | p90(5)
p99<aggregation window size, window mode (optional), drop incomplete (optional)> computes the 99th percentile of values across the given aggregation window
Ex: op> cpu_usage | p99(5)

Transforms

These functions change the data returned by a metric query (ex: cpu_usage, mem_usage, disk_available, packet_loss) by applying a function to each value in the series.

Metric Query Transform Name and Syntax Transform Function
<metric_query> | floor computes the floor of each value in the time series, e.g. cpu_usage
Ex: op> cpu_usage | floor
<metric_query> | ceil computes the ceiling of each value in the time series
Ex: op> cpu_usage | ceil
<metric_query> | upper_bound computes the maximum of its argument and each value in the time series
Ex: op> cpu_usage | upper_bound(90.0)
<metric_query> | lower_bound computes the minimum of its argument and each value in the time series
Ex: op> cpu_usage | lower_bound(15.7)
<metric_query> | limit takes a number as an argument and limits all the time series in the result to
have a number of points less than or equal to this number
Ex: op> cpu_usage | limit(4)
<metric_query> | shift shifts the values in the time series by amount given by its argument

Ex: given timestamps [0, 1000, 2000, 3000, 4000] and values [0, 1, 2, 3, 4], piping into shift(2) gives timestamps [2000, 3000, 4000] and values [0, 1, 2], while piping into shift(-2) gives timestamps [0, 1000, 2000] and values [2, 3, 4].

Ex: op> cpu_usage | shift(-2)

Parameters

Metric Query Parameter Name Function
window (left end parameter, optional right end parameter) Window function specifies the range of time (left end to right end) for the Op statement. If the optional right range parameter is not specified, the right end of the time range defaults to now.
Ex: cpu_usage | window(30s) gets cpu_usage from 30 seconds ago to now
Ex: cpu_usage | window(30s, 10s) gets cpu_usage from 30 seconds ago to 10 seconds ago
from, to limits the metric query from time from to time to
Ex: cpu_usage | from=1000 | to=3000 gets cpu_usage from time 1000 to time 3000
Base, offset If offset is positive, limit the query to cover the interval (base, base + offset) and if offset is negative, limit the query to
cover the interval (base + offset, base)
Ex: cpu_usage | base=3000 | offset=2000 gets cpu_usage from time 3000 to time 3000 + 2000 = 5000
resolution sets the resolution of the data queried for, allowed resolutions are 1 sec, 10 seconds, 1 min and 1 hour.
Ex: cpu_usage | resolution=60 gets data at one minute resolution

Resource Aggregates

These functions aggregate metric query results across resources.  For example, to get the sum of cpu_usage across all hosts, you would write host | cpu_usage | r_sum.

Metric Query Aggregate Name
r_sum aggregates the metric query across resources by taking the sum of the values across each resource for each timestamp
Ex: op> cpu_usage | r_sum
r_count aggregates the metric query across resources by counting the number of values across each resource for each timestamp
Ex: op> cpu_usage | r_count
r_mean aggregates the metric query across resources by taking the mean of the value for each resource for each timestamp
Ex: op> cpu_usage | r_mean
r_max aggregates the metric query across resources by taking the max of the value for each resource for each timestamp
Ex: op> cpu_usage | r_max
r_min aggregates the metric query across resources by taking the min of the value for each resource for each timestamp
Ex: op> cpu_usage | r_min
r_stddev aggregates the metric query across resources by taking the standard deviation of the value for each resource for each
timestamp
Ex: op> cpu_usage | r_stddev |
r_p1 aggregates the metric query across resources by taking the 1st percentile of the value for each resource for each
timestamp
Ex: op> cpu_usage | r_pl
r_p10 aggregates the metric query across resources by taking the 10th percentile of the value for each resource for each
timestamp
Ex: op> cpu_usage | r_p10
r_p90 aggregates the metric query across resources by taking the 90th percentile of the value for each resource for each
timestamp
Ex: op> cpu_usage | r_p90
r_p99 aggregates the metric query across resources by taking the 99th percentile of the value for each resource for each
timestamp
Ex: op> cpu_usage | r_p99

Resource Query Parameters

These parameters limit the results of a resource query via pipes.  For example, the piped query host | limit=2 gets resources of type HOST and limits the result to two resources.

Resource Query Parameter Name Function
type specifies the type of resources query is for
Ex: resources | type="POD" gets resources of type POD
from, to limits the resource query from time from to time to
Ex: host | from=1000 | to=3000 gets the resources that existed between 1000 and 3000
base, offset if the offset is positive, limits the query to cover the interval (base, base + offset) and if offset is negative, limit the query to cover the interval (base + offset, base)
limit limits the number of results of the resource or a combination of resources (referred to as tuples) query.
Ex: host | .pod | .container | limit=5 limits the result and returns at most 5 tuples of type (host, pod, container).
random limits the number of results of the resource query by selecting random=0 or random=1. Random can be either 0/1. If random is 1, then the resources return will be shuffled (returned in random order.) 0 is the default.
Ex: host | random=1 | limit=10 results in 10 random hosts.
name only include resources with name the same as its argument
Ex: host | name="example" gives all of the hosts with name "example"


How did we do?