A CoPE’s Guide to Alert Management

A CoPE’s Guide to Alert Management

6 Min. Read

Alerts are a perennial topic, and a CoPE will need to engage with them. The bounds of this problem space are formed by two types of alerts: 

  • Reactive alerts (in Honeycomb, we call these Triggers): They are alerts that fire after some event, like crossing a pre-determined boundary. 
  • Proactive alerts (Burn Alerts based on Honeycomb’s SLO feature): These give notice before crossing a threshold; in the case of SLOs, that means before failing to meet the stated objective.

Understanding what these alerts are and how to configure them is one thing. Thinking about what they each do for your organization, and how using one or the other affects things, is another. The latter will be the focus of this article.

Evaluating the utility of each type of alert

The great challenge of alerts is how to get them to the right people at the right time. That’s because an alert conveys information signals, but if it isn’t transmitted to the right person at the right time, then it’s just noise.

Such a situation is, in fact, much more prevalent with Triggers than with Burn Alerts. Why? 

Triggers are extremely particular in the signal that they convey. Consider the following scenario: You’ve set up an alert to sound if a load-bearing column in a building exceeds its safe capacity. In such a case, it’s totally appropriate for someone who receives the alert to respond “So what?” Only people with prior knowledge about the situation can answer that question. To them, it’s information. To everyone else, it’s noise.

Burn Alerts generated by SLOs, by contrast, are much more likely to prove informative to a general audience. That’s because context is built into them through the SLO. A Burn Alert effectively tells its receivers that they either have some amount of time before too many bad things will have happened, or that an unusually large number of bad things have happened in a certain window of time and that’s putting them at risk of having too many. Returning to the column example above, it’s like getting a warning as the load is getting to be too heavy.

Burn Alerts and SLOs inform their receivers about what the organization values and the hierarchy of those values. They let organization members know what to prioritize when load shedding is necessary. For example, it’s simple to decide between working on a new feature or stabilizing API latency when a Burn Alert has indicated that you have 12 hours until you miss your API’s SLO for the month. This creates a shared understanding and common ground for communication and collaboration.

This is why Honeycomb advocates so strongly for using SLOs. That shared understanding works perfectly with a well-defined north star: customer experience. Working together to map critical user journeys and then establishing landmarks via SLOs with proactive alerts is the paradigmatic setup for achieving production excellence.


Get started with Honeycomb SLOs today.


The problem

Despite these differences, many organizations don’t effectively differentiate their alerts. Reactive alerts are almost certainly the most common type used, and their volume and lack of context hinders prioritization. This induces alert fatigue and harmful stress.

A CoPE must understand that alerting is crucial to their success, so they need to develop and implement a solid alerting strategy.

The solution

A CoPE should endeavor to make all alerts into Burn Alerts; in practice, this isn’t really possible, so the goal should be to optimize the ratio of Burn Alerts to Triggers given the organization’s needs.

Every Trigger is an operational risk. They indicate that the organization hasn’t created a strong enough structure that depersonalizes necessary information and activity, meaning that the organization has a bus factor. They also indicate a relative concentration of power, because these are the people who benefit from information asymmetry. The ones who can answer that “So what?” question for a given Trigger are a pocket or silo within the organization—but our model accounts for that, and acknowledges them as necessary factors in complex systems. With that in mind, the best thing to do is to make them explicit and learn how to work them to the organization’s advantage.

The method

The first thing for a CoPE to do is to convert as many Triggers to Burn Alerts as possible (My colleague Fred really leans into the difference between Exhaustion Time alerts or Budget Rate alerts. Don’t forget to take that into account!). This means reformulating them into forward-looking goals, like the load-bearing column example from above. The new alerts should then route to public spaces, like a general #Ops channel in Slack or dedicated channels for specific on-call rotations.

Any remaining Triggers should be routed to private DMs for the one or small set of people who have the relevant context. This makes sure they get relevant reactive alerts without creating alert fatigue for others. These Triggers should be reviewed regularly with their recipients to check if they’re still necessary or if they can be made into SLOs or even just deleted.

Finally, the CoPE needs to build in social institutions that manage the pockets of asymmetrical information. I recommend the Learning from Incident-style incident reviews because their whole point is to induce the circulation of information. When people feel that they can speak openly about what they did and how it made sense to them in a way that benefits everyone, they become encouraged to share their power and thereby redistribute it throughout the group. In this case, that power is information, and sharing it makes for a more resilient organization.

Next up: Prod-centricity

Don’t forget to join us next week for part six in our CoPE series. If you missed the prior posts or you’re just stumbling on this series now, you can find the other parts here:

Pt. 1: Establishing and Enabling a Center of Production Excellence

Pt. 2: Independent, Involved, Informed, and Informative: The Characteristics of a CoPE

Pt. 3: Staffing Up Your CoPE

Pt. 4-1: The CoPE and Other Teams, Part 1: Introduction & Auto-Instrumentation

Pt. 4-2: The CoPE and Other Teams, Part 2: Custom Instrumentation and Telemetry Pipelines

Don’t forget to share!
Nick Travaglini

Nick Travaglini

Senior Technical Customer Success Manager

Nick is a Technical Customer Success Manager with years of experience working with software infrastructure for developers and data scientists at companies like Solano Labs, GE Digital, and Domino Data Lab. He loves a good complex, socio-technical system. So much so that the concept was the focus of his MA research. Outside of work he enjoys exercising, reading, and philosophizing.

Related posts