SLOs—or Service Level Objectives—can be pretty powerful. They provide a safety net that helps teams identify and fix issues before they reach unacceptable levels and degrade the user experience.
But SLOs can also be intimidating. Here’s how a lot of teams feel about them: We know we want SLOs, we’re not sure how to really use them, and we don’t know how to debug SLO-based alerts.
Don’t worry, we’ve got your answer—observability! SLOs are much more effective when driven by observability event data.
In this post, I’ll explain that concept in more detail, and if you want to learn more, check out chapters 12 and 13 from our O’Reilly book or watch our webinar where Charity, George, Liz, and I share our favorite SLO stories and do a live demo.
Alert flood? Get to higher ground with observability
Does this sound familiar? “Heeeeeeeeeelp, I’m drowning in alerts!” Maybe you’re getting one from Memcache, one from that Ruby instance, one from Redis, and a bajillion others. But with so many alerts, pretty much all the time, it can turn into background noise. We call that trigger overload. The issue is that in a flood of triggers, it’s easy to accidentally miss real problems.
This is a common starting place for lots of our customers. When they moved to SLOs, they were able to break through that flood of alerts because when the SLO went off, they knew there was an imminent problem that needed their attention.
Maybe you’ve heard horror stories about teams MacGyvering SLOs with spreadsheets, and taking daily stats for insights into Service Level Indicators (SLIs). Or you’ve become so accustomed to alert flooding yourself, that it’s a scary thought to imagine that flood running dry. At the heart of these fears lies the central question of how you’re going to debug SLO-based alerts. And part of that has to do with how we thought about SLOs initially.
SLOs based on time, good. SLOs based on events, awesome.
Early on, people tried to measure SLOs with time series and metrics. For example, tracking the 95th percentile latency of requests less than 500 milliseconds over each 5-minute window. But that introduced two problems. Number one, it’s not granular enough because either the whole 5-minute window succeeds or fails. Number two, it’s impossible to debug.
The solution? Put observability and SLOs together, so you can use real event data to see what’s going wrong. With the arbitrarily wide structured event as your basic building block for observability, you can capture high-cardinality, high-dimensionality, and exploratory data that provides enough context to slice, dice, and answer unknown unknowns. We recommend one SLO per user journey, and that SLO should look at error and duration.
Here’s an example of how to define an event-based SLI. Let’s say a good customer experience is when a user can successfully load your home page and see a result quickly. Expressing that with an SLI means qualifying events and then determining whether they meet your conditions. In this example, your SLI would do the following:
- Look for any event with a request path of /home
- Screen qualifying events for conditions in which the event duration is less than 100 milliseconds
- If the event duration is less than 100 milliseconds and was served successfully, consider it OK
- If the event duration is more than 100 milliseconds, consider that event an error even if it returned a success code
In addition, SLO-based alerts must be actionable. An SLO based on wide structured event data can use the additional details to find common attributes that lead to the error budget burn. These actionable SLOs lead to much faster remediation times, and happier users.
In an SLO-based world, you need observability because it gives you the ability to ask novel questions of your systems without having to add new instrumentation. With rich telemetry, you can start wide and then filter to reduce the search space. This allows you to determine the source of any problem— regardless of how novel or emergent the failure may be—in seconds if you’re using something like BubbleUp in SLOs.
Why can’t I just use alerts from a monitoring tool?
At this point you may be thinking, aren’t monitoring alerts enough? Well, if you’re solely relying on monitoring alerts, then you’d have to know what will cause an eventual issue and set up an alert for it. What if the problem was caused by people using iPhone iOS 14.1 with a French language pack hitting out of Canada? Would you create an alert for all those conditions?
That’s where SLOs pull ahead of cause-based alerts. They’re not influenced by hindsight bias and let you go on a debugging journey no matter where the issue is coming from, rather than presupposing certain aspects of your system are going to break.
Take the engineers at Honeycomb, for instance. They realized the power of SLOs during a partial degradation that our SLOs caught but our monitoring missed, because they were looking for two consecutive failed probes in a row. When you have a 1–2% brown-out, two consecutive probes failing is not likely to happen. Luckily, our SLO immediately started burning and we realized, relatively quickly, that our users were being affected.
That was our tipping point for SLOs because they helped us understand the nuances of what was going on with users, especially in the long tail, rather than just the majority experience.
Get your SLO fix
This was just a quick overview of our SLO discussion. Watch the webinar recording or check out our O’Reilly book to learn all the details. Stay tuned for our next discussion in our Authors’ Cut series on August 23, where we’ll cover CI/CD architectures and how observability can be used to debug pipeline issues. If you want to give Honeycomb a try, sign up to get started.