Bots: Preventing Toil

Updated 1 week ago by Shoreline

The final piece in Op automation is connecting alarms in actions.  When we detect a pattern of interest, let's kick off the corresponding mitigation.  Runbooks encode this connection, but they don’t address detection and migration.  Those activities are left as manual tasks for the operator. To truly have end-to-end automation, we create a bot.  Bots bind alarms and actions together using an IF-THIS-THEN-THAT style.

Let's create a bot that upon high cpu, kills the background-logger process.  As we said before, the background logger will restart again later to finish pushing logs, after the cpu usage has come down, so it is safe to disable it temporarily.

op> bot logger_stopper = if cpu_high then killall("background-logger") fi

Just as we enabled alarms, let's enable our bot:

op> enable logger_stopper

Now we have entirely completed an automation of what would formerly have been a runbook.  Instead of checking for high cpu on a dashboard, we have an automatic alarm.  Instead of manually killing a background process, we have a defined action. And instead of getting a page, we have a bot that watches the alarm and triggers the mitigating action automatically.

How did we do?