Start measuring what matters with Honeycomb today.
Have you ever had an alert go off that you immediately ignore? It’s a nuisance alert—not actionable—but you keep it around just in case. Or maybe you’ve looked at a trace waterfall and wondered what exactly happened during a gap that just doesn’t drill down deep enough to explain what’s going on. Do you know the feeling where you have just enough information to monitor what’s going on in your systems, but not quite enough to put your mind at ease?
These experiences are almost universally known—every engineer can relate to them at some point in their career. But here’s the good news: with tailored instrumentation and well-reasoned Service Level Objectives (SLOs), you can measure what matters, reduce the noise, and help your team focus on actionable signals.
The cost of alert fatigue
Does this sound familiar? An alert goes off. Your heart rate spikes. You’re annoyed, distracted, and after glancing at the alert, you do… nothing. You carry on with what you were doing before because the alert doesn’t actually require any action from you.
If no one responds to the alert, why does it exist? Alerts that don’t lead to action are worse than useless; they condition teams to ignore signals entirely. This isn’t just annoying; it’s dangerous.
Over time, you can become desensitized to the alerts. When you experience alert fatigue, none of the alerts mean anything anymore. Then, when a genuine failure happens, you don’t notice or react to it because you’ve learned to ignore all the signals. This phenomenon is called “normalization of deviance,” a phrase used by sociologist Diane Vaughan to describe the environment that led to the Challenger disaster. Everyone became insensitive to signals that things were going wrong and it ended in catastrophe.
Prioritize actionable alerts
If you’re thinking, “We don’t even have that many alerts,” consider this: even a handful of useless alerts erodes confidence in your alerting system. The key is ensuring that every alert is actionable to the person being alerted.
One example of a non-actionable alert is in self-healing systems. If a pod gets OOM killed and automatically comes back online with little to no disruption to end users, the system is working as intended. Log these events for business-hour review. If they happen often, consider scheduling improvements into the roadmap, but don’t page anyone.
Another example is from a scenario that my team experienced. We had this alert for total throughput end to end for the service we owned, and it would page if throughput was reduced. The problem was that it would primarily go off when there was a problem upstream of us, even if our service was fine. The team responsible for the upstream service was already handling the problem, so we were getting paged for no reason. It turns out throughput was not a good indicator for a problem that we ourselves could tackle. So, we turned the alert off as we worked out a better way to signal a problem with the service itself.
Adding instrumentation to see the whole picture
How do you identify and prioritize actionable alerts? This is where instrumentation comes into play.
Instrumentation ensures you have the data you need to better understand what’s going on in your system and allows you to slice and dice the data in different ways.
A good first step is to get some auto-instrumentation in place for tracing HTTP requests. With that, you can see useful information like endpoint latency and request volume. However, auto-instrumentation alone often leaves gaps. For example, consider an average latency for an endpoint. You can tell at a glance if there’s a major outage, but without getting details on percentiles, you’re missing the experience of users. If 5% of users are having problems, you need a way to learn more about what is unique to their experience.
To get more granular details, you need to customize the instrumentation. For areas of the app that aren’t covered by automatic instrumentation, you may want to add your own spans. Wrap your business logic in spans to find things like unexpected slowness in parsing requests. Add context-specific attributes for details available to provide a fuller picture and make it easier to find anomalies. For example:
- Add geographical information to know if one region is experiencing higher latency.
- Add Logged in / Anonymous to see if logged-in users experience slower responses because of additional database lookups.
- Add anything you can to describe the current state of an application at any given time for any given user. You don’t know what will be different the next time a failure mode occurs.
Set up SLOs
An effective alerting strategy requires Service Level Objectives (SLOs). SLOs tie service health to user impact, providing thresholds for system performance to ensure alerts are meaningful and aligned with user expectations. A Service Level Indicator (SLI) is a metric that measures service health, such as “Checkout page loads in under 2 seconds.” An SLO is a target tied to an SLI, such as “Checkout page loads in under 2 seconds 99.5% of the time.”
SLOs should always tie into business impact and focus on user-critical paths. One size does not fit all. Outliers can skew metrics, and different SLOs may be needed for different use cases. I love the framing from one of our SREs: “The goal of an SLO is to provide a useful signal: if including a specific datapoint dilutes the signal, then it may be worth excluding them or creating a different measure for them.”
On an e-commerce website, a slower homepage load may be less critical than a slower checkout page load. A database lookup over a large volume of data will be more resource-intensive than a database lookup for a small table. Use the added context from attributes and custom instrumentation to finetune SLOs to match business requirements.
One important consideration when designing SLOs is to be realistic. Do not aim for 100%. Perfection is not admirable; it is costly and unsustainable. Instead, be intentional about setting goals that balance reliability and resource investment.
Revisit and refine over time
Your SLOs and alerts are not written in stone. As your applications and customer expectations evolve, your alerting strategy needs to adapt. Make a process to regularly review your alerts and determine whether they are still actionable and if they reflect current needs. If you’ve made improvements to endpoint latency, consider tightening a target response time to match new expectations.
Remember the scenario I mentioned earlier about my team disabling unactionable alerts on a service we owned? We ended up adding an attribute called stress_level
that measures memory usage and queue sizes, and used that to create an SLO that was more reliable in predicting issues with the service’s health. Now instead of getting paged for problems out of our control, we only get paged for sharp increases in stress_level
, and smaller blips are logged for review over time.
Customers can also provide valuable feedback that will help you examine your setup. In Alerts are Fundamentally Messy, Fred Hebert describes conversations with our customer success team asking about customer complaints. Sometimes, we realize we overindex on a problem that isn’t especially relevant to end users, and we can turn down the volume on those. Other times, we learn of problems that maybe we weren’t aware of, which is “a sign we may have under-sensitive signals.”
- First, reduce the noise. Lower your stress. Take a step back.
- Now, find all the places you can add more details, more instrumentation, more attributes.
- If it’s something you may want to log at some point, add it to your active span.
- And once you’re armed with all those details and have more visibility… start setting SLOs.
By focusing on actionable alerts, investing in instrumentation, and setting thoughtful SLOs, your team can sleep better at night knowing you’re measuring what matters and reducing the noise of unhelpful alerts.
Here’s a link to the talk I gave at Observability Day during KubeCon + CloudNativeCon NA 24, if you’d like to watch it, along with the slides.