Define SuccessThe Right SLOs for Your Org
#1 SLO Theory: Why the Business Needs SLOs #2 Get Started: Build One Simple SLO –>
Transcript:
Liz Fong-Jones [Developer Advocate|Honeycomb]:
Hello, and welcome to our webcast on Production SLOs: Success Defined. This is part three of a three part webcast series where we’re explaining the value of service level objectives for your production systems. The theme of this series is success defined, and our goal is to take you on a journey and share our thinking about how it’s important to create SLOs for your business, how to define the best SLOs, and how to iterate over time to achieve the most success with your business.
SLOs are unique because they bring together a number of different stakeholders from engineering to operations to product, to agree on what the important things are that we ought to measure about our applications and services. So with that, let’s dive into the third part of this series where we’re going to talk about how to solve your reliability fears with service level objectives.
My name is Liz Fong-Jones, and I’m joined by Kristina Bennett. We are here on behalf of Google and Honeycomb to tell you about some best practices that we developed over our years working together on the customer reliability engineering team at Google. Kristina, what worries you about your production services? What worries you about reliability?
Kristina Bennett [Site Reliability Engineer|Google]:
I worry about whether our service is continuing to behave and run fast enough while nobody’s watching it.
Liz Fong-Jones:
Yeah, that’s true. And in your case, you operate services on behalf of many, many important customers that use Google Cloud, right?
Kristina Bennett:
Absolutely.
Liz Fong-Jones:
So you just went through Black Friday and Cyber Monday, which I imagine are very important times of the year for those customers.
Kristina Bennett:
It absolutely is. And it’s a uniquely challenging time of the year because it’s always a challenge when any customer is having some kind of peak activity going on. But events like Black Friday are uniquely set up to create that kind of resource scarcity and load levels for a large swath of customers all at the same time, which creates a different sort of challenge.
Liz Fong-Jones:
Yeah, that totally makes sense. And in the case of Honeycomb, we worry a lot about the reliability of our services because they’re how our customers measure their own experiences. So we care about making sure that our customers are able to send us their telemetry to analyze. And when that doesn’t work, then our customers lose trust in their own ability to observe their systems. And that is really frustrating for them.
As we often say, observability is like the ability to see into your systems, and if you can’t see into your systems, that makes it harder to understand what’s happening inside of your systems. And I think beyond those immediate needs of our own services, both Kristina and I wind up thinking a lot about how we make our customers successful and how we help them get more reliability out of their services.
02:53
So today, we’re going to talk through four things. First of all, we’re going to recap what a service level objective is. Secondly, we’re going to talk about how to define the right service level objective. Third, we’re going to talk about how you enforce your SLOs and make them actually practical. And finally, we’ll talk about the benefits of adopting SLOs and how they make your engineering teams more agile. So let’s start by introducing ourselves. Kristina, tell us a little bit about yourself.
Kristina Bennett:
Hi, I’m Kristina Bennett, and I have been at Google for over 10 years now. And for over six of those, I’ve been an SRE. I’ve been a variety of things there, working on ads and storage, but now, I’m on the customer reliability engineering team, a special SRE team that is customer facing for Google Cloud.
Liz Fong-Jones:
Cool. And I’m Liz Fong-Jones, and I spent 10 years at Google as a site reliability engineer. And my second to last team on Google was working on the customer reliability engineering team with Kristina. And I still have the patch from that team. And today, I am a principal developer advocate at Honeycomb where I help people adopt SRE practices by helping them gain better observability into their systems.
Kristina Bennett:
Now, it was mentioned in episode one that one of the key realizations that Ben Treynor Sloss, the founder of SRE at Google, had was that 100% is the wrong reliability target for basically everything. And why? Because as you make incremental increases in your reliability towards 100%, those increases are going to cost exponentially more and give you diminishing returns for you and your users.
And as you make these increases, you will be monopolizing more and more of your resources, which is going to slow you down and prevent you from delivering on other things besides the reliability of that one service. And the idea of choosing a target that is not 100% and then adjusting your behavior according to that is where error budgets come from. So these familiar numbers help us describe the consequences for setting a particular availability level for your service.
For example, three nines or a 99.9% target. If you were perfectly reliable for the rest of the time, it would allow you 43 minutes of downtime per month. But to use that number of error budgets, we would have to do a bit more math. First, engineering and the business stakeholders should talk to each other to help determine what is the correct availability target for your service.
And once you’ve chosen that target, the downtime allowance for that target, as described in the table we just looked at, becomes your budget of unreliability or your error budget. When you have the error budget, you now need some monitoring, so you can measure what your actual performance is like. When you know what your actual performance is, the difference between that and the target performance represents how much of your budget is being spent by that deficit.
So now that you have some numbers that show what your actual performance is and how it relates to the performance that you would like to have, you now have a control loop for utilizing this budget to direct your actions and improve your service as best benefits you and your users.
06:32
Liz Fong-Jones:
So how does this tie into the idea of the service level objective we were talking about earlier, because I heard you talking about error budgets but not SLOs in this most recent couple of minutes?
Kristina Bennett:
Well, the SLO is actually your availability target. Your availability is your service level and the target is your objective. So that’s your SLO. And the SLI, or service level indicator, is the metric that you’re using when you’re measuring your performance against that target. You may use or have heard of an SLA, a service level agreement, which is an important business extension of providing predictably reliable services for your users. But for this talk, we’re going to keep our focus on SLIs and SLOs.
Liz Fong-Jones:
To summarize, it sounds like the service level objective is the inverse of the error budget, that the error budget is the amount of errors allowed after your service level objective. And the service level indicator defines whether for a given interaction whether or not that interaction was successful. Great. I think that that helps us get all on the same page about what a service level objective is. So now, let’s turn our attention to how we set up our SLOs in order to make sure that our users are happy.
Kristina Bennett:
To make sure that the users are happy, we need that measurement of what makes an interaction successful enough, and it needs to represent what the users are actually seeing. If the users visit your service and the landing page won’t load, the users are not going to care why. They don’t care if it’s your load balancer or your database. They only know that they went there, they wanted your page, and they didn’t get it in the time that they were willing to wait.
So we want SLIs that can quantify that experience and map the user’s overall happiness with the interaction. Well, what kind of a metric is that that you could use to build a good SLI? We usually have a lot of metrics that we use to manage our service day to day. A lot of them are things like CPU usage or queue length that represent the internal server services, but they also tend to be very noisy and not representative of what the users are actually seeing moment to moment.
And they usually look like the graph on the left that we’ve labeled “bad” here. The graph on the right shows a smoother metric, one that we’ve aggregated enough to smooth out the noise and that we want to correlate linearly with user happiness. So let’s imagine that the area indicated in red in the center here represents an actual outage our services has had, and that these are two metrics that we’ve been measuring.
Graph on the left does show a distinct downward slope during the incident, but the large variance of the metric is obscuring when exactly the incident started. And there’s a lot of overlap in the values between the ones that you’re seeing when the service was operating well and the ones you saw during the incident.
That really is going to obscure at what point the operation is actually good or bad. And it’s not very representative of what the users were seeing. Graph on the right shows a distinct dip during the time of the incident. The values go down when the users are experiencing poor interactions and they go back up again as a service starts to recover.
Liz Fong-Jones:
The important property here is that we can set a concrete threshold, and that that threshold is measuring something about the user experience such that if it’s above or below, we can know to mark the experience as good or bad. On the graph on the left, what we’re seeing is that setting any limit, no matter what that limit is, does not enable us to classify the events during the outage in which users were unhappy as bad. Whereas, in the example on the right, we can say that anything below that limit is a bad experience. It doesn’t necessarily have to be below. We could also see, for instance, situations where values above a certain number would be bad rather than good.
Kristina Bennett:
That’s true. It depends on what you’re measuring, but the key is that when you’ve decided which side of the line is good, then you now can build a metric that maps to a binary outcome of whether each interaction was good or bad.
Liz Fong-Jones:
And this maps very well to our discussion in the previous SLO webinar episodes in which we talked about the idea that the Honeycomb derived column language can enable you, in combination with fields you’ve already set on your events, to classify things as good or bad. So let’s talk about maybe some of the things that we might want to measure about our services. What are the properties that we might want to have on our events in order to classify them as good or bad, Kristina?
11:37
Kristina Bennett:
Well, although there are many kinds of services at many levels of complexity, there are luckily a few key kinds of dimensions that apply very broadly across these different kinds of services and are the ones that users tend to care the most about. For many services, we can pick from what we call in the SRE team, this SLI menu. For request and response synchronous interaction such as an interactive webpage or a synchronous interface, a successful enough interaction would usually be measured in terms of availability, latency, or the quality of the response that you got. Did it respond? Was it fast enough? And when it responded to you, was it correct and complete?
Liz Fong-Jones:
Let’s talk through an example of that. For the case of a request response service like a shopping website, what would be an example of availability, latency and quality?
Kristina Bennett:
Well, if you tried to visit the site and the page started to load, then that would be an availability. You didn’t get a 404 when you tried to visit the site. If you went there and it loaded very quickly, that would be an example of successful latency. And if you went there and it showed you available purchases that you would expect to see, and gave you a complete and consistent list, so that for instance, if you performed the same search multiple times, did you get the same subset of answers or at least a consistent response, then that would represent a high quality response.
If, on the other hand, you have a data processing service successful enough it is going to start to look more like whether the overall result was complete and correct enough. So for instance, if you had a feed tag processor or a cumulative report generator, you would want those reports to come out consistently with all of their information correctly displayed. And if it is a streaming service, you’re probably going to care about the freshness of the results and make sure that the streaming lag isn’t too long.
And for a batch processing you’re going to care about how fast the end to end processing is going, whether the results of arriving when they’re still relevant, and fast enough that it doesn’t start to get backed up. And of course, there’s also storage. For storage, we typically want to know whether we have good durability. What is the proportion of written data that we can successfully reread?
Liz Fong-Jones:
And indeed, I think in episode one, we talked about a durability failure example. So these are all things that are relevant to systems that we run between Google and Honeycomb.
Kristina Bennett:
Absolutely. And for any of those metrics, a convenient way that we can calculate them, which was again referenced in episode one, is as a proportion. This is very useful for us because when we represent it as a percentage where we can map zero to completely bad and 100% to completely good, we can really intuitively understand what’s going on.
It becomes very easy to reason about this number, to set a target and understand how the metric is performing relative to that target, and to have many SLIs that we can understand without constantly processing the difference in their calculations. We know that if we see 80% on this one and 85% on that one, that those are percentages and that those are those relative values. We’re not doing any normalization or comparison across the different calculations in our heads.
15:48
Liz Fong-Jones:
And it is true that there are many different ways of expressing SLIs and SLOs. However, in the case of high incomes product, because we’re an event based product, we believe that a good event, bad event model works much better than other models in our product. However, you may also see SLO formulations that are based off of fractions of time, for instance, where a good event is defined as a good five minutes or a bad five minutes. However, we’re not going to cover that in this particular series.
Kristina Bennett:
It’s true. And even in those approaches, usually what determines the success of those minutes is an event examination in the underlying calculations.
Liz Fong-Jones:
Yes, exactly. So it speaks to a question of granularity, where having access to those raw events tends to be better. So let’s talk about how we classify events. How do we define what that SLI means? In order to define what an SLI means, we have to think about those good users and bad user experiences. The thing about, is an individual user journey happy or is it one that made the user dissatisfied with us in some way, big or small? And that’s how we wind up defining service level indicators.
Kristina Bennett:
Right. We want them to be able to tell us whether each interaction was successful enough. Now we have a way we can measure everything, right?
Liz Fong-Jones:
Well, maybe not. We need to focus on user experiences, which means that we need to focus on what the most important user journeys are because it probably matters very, very much whether someone can use the home page of our shopping site but maybe not so much whether they can use an individual settings widget inside of their preferences. So therefore, we need to think about the overall business objectives and wind up setting one to three service level indicators for each distinct user journey that we expect our users to be able to achieve on a consistent basis.
Kristina Bennett:
Can you give me any example of what you mean by user journeys?
Liz Fong-Jones:
For instance, a user journey involving an eCommerce site might be that a new visitor to our website should be able to load the homepage to see items available for purchase, and then be able to click through to view one or two items, and then to be able to check out, that that might be one core user journey of the purchaser experience.
However, there might be a distinct user journey for people who are listing items for sale where they might instead expect to be able to upload photos of their items and have those items turn up in the search index and be sellable to users. Those are a variety of different kinds of user journeys that you might expect even on a site such as an eCommerce site that is relatively simple to describe and that we hopefully all can understand based off of the description.
Kristina Bennett:
I see. So we could have several different kinds of user journeys for our service, but for each of those, we want to make sure we only have one to three indicators to avoid trying to overload ourselves with too many measurements.
Liz Fong-Jones:
Exactly. So for the example of the seller, we probably don’t want to have a service level indicator for “can the seller update their mailing address?” And a service level indicator for “can the seller change their payment options” or “can the seller upload an item?” Having that many distinct service level indicators would drown us in noise and we’d be overwhelmed.
Kristina Bennett:
Great. But we can certainly choose our most critical ones. We can measure how often their uploads are successful and perhaps take that as one of our SLIs because we’ve determined that that is a critical measurement that is directly relevant to our primary user happiness and interactions.
Liz Fong-Jones:
Yes, exactly. But Kristina, you work on Google Cloud. Google Cloud is a really complex product. How do you handle that situation? How can you manage to restrict yourself to one to three SLIs?
Kristina Bennett:
Well, it is difficult. There is a lot to prioritize, but as you’ve already illustrated it in several ways, not every action or interaction is equally critical to the success of our service. So we really can’t be afraid to prioritize which user journeys are the ones that are critical and make sure that we focus on those. We still want to measure everything else and keep an eye on it, but the things that we’re going to prioritize in terms of what’s going to alert as quickly and what do we base the success of our service has to be our most critical interactions.
And there are also a couple of strategies we can use to help prune the measurements that we’re making or group them into categories so that we can broaden our coverage without overwhelming ourselves quite so much with different SLIs. For instance, a lot of services will have a lot of conceptually similar journeys. For instance, in the Play Store, there are many different ways that you might browse the content of different apps and look for things that you are interested in.
All of the activities pictured here could be considered as variants on a single browse-the-store journey. So as long as these different interactions all have about the same magnitude, you can group them together into a single browse measurement and use that as a way to simplify the number of things that we’re measuring individually.
That’s why a different strategy that you might use would be to bucket together things that do have the same threshold, as long as they’re sufficiently similar. For instance, if you have many different types of latency SLIs, then you could consider grouping them together into categories that do have similar magnitudes.
For instance, you might have a single category that you label as interactive requests. Those might have a very strict latency requirement on them such as 500 milliseconds. But you might also have separate interactions that are in a right category that you allow to have a longer latency requirement and that you know that users will also expect to take longer.
22:16
Liz Fong-Jones:
That’s definitely correct. I expect running my credit card to take many seconds like that. That’s not a problem. Or if I’m uploading a large photo, of course that should take a little while. I’m willing to wait for that.
Kristina Bennett:
Sure. But if you click the next button on a series of text boxes, you don’t expect that to take several seconds. But we haven’t covered yet exactly how we should write these SLIs. We know we want to measure them, but what does that mean? Well, when we at the CRE team are specifying SLIs, what we strongly recommend is to distinguish the specification from the implementation.
And what I mean by that is to have a specification as a simple statement of what interaction you are trying to measure. For instance, as shown here, the profile page should load successfully. It’s a very concise and direct statement of what it is that we want to measure with as few words as possible. This framing is very understandable for both the engineers and the product people and leadership to understand what we are trying to measure. And it should represent something that’s important to you and your users.
But it really leaves a lot of questions about what exactly this means and how you’re going to go about it. For instance, it says load successfully, but what does success mean here? And how are we measuring it? Where are we measuring it? Here’s an example of what the implementation of the SLI might look like. In this example, we say that the percentage of get requests for these specific URLs that have responses from this given list will be measured at the load balancer.
Liz Fong-Jones:
Which makes perfect sense because in Honeycomb’s view of the world, if you’re ingesting your TCP level seven load balancer logs, you want to make sure that you are analyzing those logs and saying, okay, what’s the status code? What’s the path? Is the path something that I’m expecting to measure or is it something that’s extraneous. And then, did the error code match what I was expecting to see? So an alternative view of that might for instance be measuring it using instrumentation in your server side code or measuring it from the client. But those are all various perspectives with different trade offs.
Kristina Bennett:
Exactly. They’re all valid approaches, but they’re going to have different value for you depending on what your primary goals are with the measurements and what technology is available to you and practical for this instance. For instance, perhaps you’d really like to measure the user browser because you know that that’s going to represent what the users are seeing. But you’re going to experience perhaps more latency in getting the results of that measurement than if you were measuring directly on your own servers. And it may provide a significantly more technical difficulty than using the metrics that are already readily available to you.
With each of these choices, we can shape the efficacy and complexity of our SLIs. It becomes very easy here to get mired in the details of what we’re trying to do. But this is when we go back to specifications, just always there as a simple statement to remind us of what it was we were originally trying to measure. And with these techniques, we can now develop some really good SLIs. But how can we create this line that represents user happiness, our SLO, our target?
Liz Fong-Jones:
And part of the challenge here is that, remember that the SLI is defining whether one user interaction is happy or sad, but Kristina and I shouldn’t get paged for every single bad user interaction. So we have to think about what is an aggregate or expectation of the number of happy user interactions?
Kristina Bennett:
That’s right. And that’s where our service level objectives are going to need both a target and a measurement window. We can use these measurement windows to think about how far back we’re looking in our measurements and what scope we’re looking at as we try to measure it.
In this example here, we’ve chosen that the target is going to be that we should be 99.9% successful in our availability. And we should measure that over the previous 30 days. This also goes back to the tables we were looking at where we said that 43 minutes was the amount of time that you would get over a month. This is where that applies, where our lookback window, our measurement window is 30 days and that’s going to help us define what our error budget’s going to be.
Liz Fong-Jones:
The other thing that I think is interesting here is this idea of decoupling the threshold that we used. For instance, saying a request is successful if it’s initially 200. Okay, what about 403’s? What about 500? And decoupling that from the aggregate view of how many requests we expect to succeed.
The other interesting property of this is thinking about brownouts. Where if we measure on a per event basis, what we’re seeing is that periods of time that have more events are more important than periods of time that have fewer events, where it’s not necessarily strictly going to be exactly 43 minutes and it couldn’t be a smaller number if you’re receiving more traffic during that window, for instance, at noon on Friday. Let’s talk now about how we define what that target ought to be because we’ve talked about several numbers like 99.9% or 95%. How should I decide that threshold, Kristina?
28:20
Kristina Bennett:
Well, what we want to do is represent user happiness. So what we need to decide is what performance do the business and the users need. We want to set targets that make sure that our proportion of successful events will be high enough to keep the users happy and keep them with our service. On the other hand, we want to make sure we don’t set it too high and expend too many of our resources in providing reliability that isn’t actually adding additional value for them.
Liz Fong-Jones:
And that’s exactly part of the historical tension between operators and developers, that operators are constantly pushing for as much reliability as possible and developers were pushing for as many features as possible. So by refocusing on this idea of what performance the business needs, it allows us to move as fast as possible while maintaining that reliability constraint. But let’s suppose that I don’t know where to start with this. Let’s suppose that my product manager, I talked to them and they don’t know what reliability number to target either. Where should we start?
Kristina Bennett:
Absolutely. This is a difficult question for everyone, and it can be very easy to get into discussions where you start to say, oh, I think we’re going to need a two year user research study to understand what the correct level of happiness is. And I think that’s actually probably a good thing to have, but you need an SLO now. So the good place to start is to remember that user expectations are strongly tied to your past performance.
If you are already performing at a particular level and the users seem happy, they’re not banging down your door and calling you up and telling you how unhappy they are with your service, but are generally satisfied, then a good place to start is just to set an achievable SLO at the place where you are performing right now. You have this data and you have all your infrastructure and your performance in place right now.
Start where you are and then what you can do is continuous improvement from that level. You should set an initial SLO, and then once you’ve had some experience with it, it will start to reveal whether it’s actually representative of users and whether you think you should shift it to be better or even less aggressive to help make a good balance between what makes your users happy and what you can achieve.
Liz Fong-Jones:
Speaking of achieving SLOs, if we start from historical performance, that usually means that I can definitely continue to achieve that in the future with less work than setting something that would be completely unattainable for me in my current state. But how do I enact changes in my SLO? You talked earlier about enacting changes in your SLO.
Kristina Bennett:
Right. Well, as you continue, you may want to improve the reliability of your service. Perhaps you feel that your current SLO isn’t as high as you would like, or perhaps you have been evolving your service, or the needs of your users have evolved and you’ve decided that you want to be at a different level than where you are.
So you should be able to set a new SLO, a new target that we can call your aspirational SLO. This is where you want to be, without necessarily being where you are right now. If it’s too far out of reach of your current technical capabilities, it gives you a goal and it lets you make a technical plan for how you can get from where you are right now to where you want to be.
Liz Fong-Jones:
Great. That makes a lot of sense. So we’ve talked a lot about how to set the SLOs, but how do we actually make sure that we manage our service according to the that SLO? How do we make sure that we actually achieve that SLO?
Kristina Bennett:
I think this would be a good time to start talking about how we enforce these SLOs.
Liz Fong-Jones:
Let’s suppose that I have a bunch of people who are concerned now that we’re tolerating errors, and let’s suppose that they say, is one million errors okay? Is 30 minutes of 100% errors okay? How do I answer those questions, Kristina?
Kristina Bennett:
Well, I think it’s time to go back and look at what your SLO is and how you’ve defined it for yourself.
Liz Fong-Jones:
That makes a lot of sense. So let’s go and look at some math together. This is a representation in a vendor neutral fashion of four different views of the same service level objective. In the upper left hand corner, we can see the burndown graph, which represents starting from today, looking back over, say, the past 30 days or over the past three months, in this particular case, how much error budget did we spend on each given day, starting from 100% of our error budget and burning down to the current state of our error budget which looks like it’s been overspent by 80%.
So we can see that we, for instance, had a large outage that happened on the 16th, 17th, and 18th of October, and that that dropped our overall availability. The second graph that we can see, that’s in the lower left hand corner, expresses on each day whether or not we exceeded our target for availability or whether we fell short of our target for availability, but it doesn’t say by how much.
In the upper right hand corner, what we can see is the cumulative error budget, which is called the rolling window success rate, which expresses starting on, for instance, October the 18th. If I were to go back and look 90 days, what percentage of successes would I see over that past 90 day window? And what would I see on the 19th looking back 90 days, and then the 20th looking back 90 days and so forth? So we can see here that the impact of the outage had an immediate downward effect upon our rolling 90 day availability.
And finally, in the lower right hand corner, you can see precisely the success rate over each individual day. How bad was it over that one particular day? On that one particular day, for instance, we can see, or over those three particular days, we can see we only achieved about 99.2% success, where our target was 99.95% success. So this helps us get a picture of both the cumulative effect of all of our previous outages as well as smaller individual outages.
34:56
So with the three-day major outage, what we can see is that the lower left hand view under-emphasizes the effect of the outage because it says that the outage happened on those days, but it doesn’t say how bad it was, and that we really need to look at the other graphs to get a sense of what impact it had on our error budget.
Conversely, we can see that at the end of several days, or a week or a month, eventually, our past outages roll back off of our error budget and that the rolling window success rate eventually recovers back to that to closer to normal as we get those things off of our collective memories of what happened over 90 days ago. But let’s also talk about the more subtle kinds of failures, about slower burn failures, where we’re performing just short of where we need to be as a service.
For instance, if we are targeting achieving 99.95% and we’re actually achieving 99.9%, what does that mean for our service? Well, it means on each individual day, we’re underperforming our target, but only underperforming by a little bit, and that this is something that we may want to address in the long term. But on day one or two, it isn’t necessarily an emergency, but it still reflects upon our error budget burn.
We can see that in the upper left hand corner that we’re burning down slightly faster than we should be in order to meet our SLO, and that cumulatively, over the rolling windows, we can see that the success rate remains about constant but is below our target. And finally, we can see on the view of each day that we are short of where we need to be.
And all four of these views are, in one way or another, expressed within the Honeycomb SLO UI, although we believe that the lower left hand view can be slightly misleading, which is why it’s omitted. And instead, what we showed you is the burndown, the rolling success rate, and then we show you the histogram and the heat map together to express what the success rate on each given day is or each given set of hours that you’re looking at.
And hopefully that helps map all of these concepts together of how the SLO and outages impact the SLO and various visualizations of the SLO. So with that, let’s talk about the idea of slow and fast burn.
Kristina Bennett:
I can see that you’ve laid out here that there’s really a big difference between a fast burn that’s going to consume my budget very quickly, and a slow burn that may actually only consume my budget over several days or weeks.
Liz Fong-Jones:
Yes, that’s exactly correct. One of those is an emergency where, if we can halt it fast enough, we can avoid blowing through our entire error budget. Whereas in the second case, we potentially have days in order to stop it before it meaningfully impacts our error budget. And therefore, these two situations need to have dramatically different responses. They also are situations that are easier or harder to detect depending on how long you wait.
You wouldn’t necessarily want to wait days in order to have confidence that you’re going to drain through your error budget in hours. Whereas you probably do need to wait days in order to have confidence that you’re going to drain your error budget at the entire end of the month. And thus, if we overspend our error budget by a factor of five, for instance, in a given hour, or conversely, in Honeycomb’s model, if we look at the past hour and project forward by four hours and we see that you’re going to run out of error budget in four hours, we think you ought to know about it so you can have a chance to stop it.
In the case of the slow burn alert, we need to sit and wait for several days in order to understand, is this actually a problem that is statistically significant and is differing from the target rate of error budget consumption that we intend? And that allows us to avoid sending you false alarms. So let’s look at what some of that math looks like.
In the case of a fast burn alert, we want to have confidence that if we project the current rate of failures of the past hour out, that it’s so far above our threshold that we know that you need to be woken up because otherwise you’re going to exhaust your error budget way too fast. For instance, in this diagram, we’re measuring the error rate over the past hour and comparing against your steady state expected error rate according to your SLO and using that to decide whether to wake you up.
In the second case where we talk about the slow burn case, we instead have to look over multiple days in order to get enough confidence that over those several days that you are exceeding your error budget by a factor of 5%, and that you’re going to deplete your error budget over the course of a month. And that’s the main difference between those two cases and why good error budget alerting needs to employ a combination of those two scenarios.
Because if we don’t have that combination two scenarios, we wind up either failing to respond fast enough to a fast burn alert or we wind up not being able to respond to situations that are over too short of a time window even if they would cause us to deplete our error budget at the end of the month. Overall, the notion of an error budget and an error budget burn alert requires you to have a control loop. It requires you to know as an operator when things have gone too wrong in order to put things back into good state.
40:19
Kristina Bennett:
And I think that a lot of teams have pretty good representation for what we call here fast burn alerts. They’re the things that if you can see over, say, 10 minutes that something is wrong with your service, it’s time to react. But I think the slow burns are something that are much harder for many of us to capture and it’s much harder for us to see and quickly react to the fact that, say, our performance has dropped very slightly below where we want it to be and only over several of these is starting to show it’s bad as users become frequently frustrated. So what should we do about this?
Liz Fong-Jones:
One way of doing things is to react with emergency response mode. However, we can’t necessarily react with emergency response all the time. At a certain point, trying to defend a 99.9% SLO when your service is constantly serving 1% errors, that’s not going to work without talking to people. So the next section of this is really devoted to the art of talking to people in order to align your aspirational and achievable SLOs.
What we’re going to look at here is four different case studies of what happens when you burn a little bit of error budget, what when you burn through all of your error budget, and what happens if you’re not able to meet your error budget no matter how hard you try. So let’s suppose that I’m doing a release and it reaches 50% of production and I realized that it’s serving errors to 50% of production, and that it’s therefore caused 30 equivalent bad minutes of 100% outage. Is this the end of the world, Kristina?
Kristina Bennett:
Well, if I look at it for a minute, I can see that you’ve said the target is 99.9%. And if I remember from our table from before, that means that we have an effective 43 minutes of downtime available to us per 30 day window. So it looks like we’ve maybe used only about two thirds of our budget so far. That’s not great, but I think we’re doing okay.
Liz Fong-Jones:
Yeah, I think that with 13 minutes of error budget left, we can make it through the rest of the month. As long as we’re a little bit more cautious. Maybe we need to roll things out more slowly than 50% in an hour. Maybe we need to automatically roll back and maybe those things will help us. So business as usual, just we need to prioritize a little bit more reliability in how we operate. Okay, but let’s suppose that I push again and this time I had appropriate canary-ing and therefore the outage only lasted 10 minutes or 15 minutes. So I’ve incurred now 12 bad minutes. Not so good. What should I do, Kristina?
Kristina Bennett:
Well, if this had been the only incident, I’d say we were in okay shape, but on top of the other ones, we’ve pretty much exhausted our budget. So I think we’re going to have to start looking at what we can do to improve things and avoid this happening again in the near future.
Liz Fong-Jones:
In practice as an SRE, what you’re saying is I need to involve other team members and maybe we need to collectively agree that we’re not going to push new features and that we’re instead going to prioritize working on reliability, as well as not freezing everything. We do need to bring that issue that could cause us to serve errors to 20% of production, to bring that under control as quickly as we can. Great. That’s super helpful.
Okay, but now let’s turn our attention to a different kind of scenario, one in which there’s nothing concrete to roll back where we might have one database server that’s falling over under load and that it used to be able to serve of only one in 1000 errors, but now it’s serving two errors out of every 1000 because of the increased scale. Or maybe it’s become slow enough that two out of every 1000 requests are serving too slowly. There’s nothing to roll back. I can’t necessarily roll back the additional user traffic. That would be throwing away a bunch of users. So what should I do in that case?
Kristina Bennett:
Yeah, now we’ve got a problem. We’re starting to look at the fact that, as you said, our architecture itself is starting to hold us back from our ability to meet the SLO that we’ve set. It’s time to take a step back and look at what our goals are and what our options are, and really think about what kind of long term fixes are going to do something about this problem.
Liz Fong-Jones:
And maybe in the short term, I can buy another MySQL server, set up a read only replica, or otherwise try to mitigate and load shift some of this load. But I know that’s not going to last in the long term. So this is a situation where it’s tolerable, but it’s something that we know we need to fix.
Now, let’s talk about the final situation of having an SLO that is so clearly out of whack with our user’s expectations that we’re aiming for one in a thousand failures, but we’re actually getting one in 200 events are failing. That’s not so good. So in this case, if I had an error budget based alert, it would be going off all the time. So what do I do to make my pager stop ringing? What should I do?
Kristina Bennett:
Well, if the SLO is clearly out of sync with what you have right now, then what you have right now is an aspirational SLO. One thing you can do immediately to solve the pager problem is to look at what your actual achievable SLO is right now and make that your new current standards so that you and your team can sustainably maintain the service where it is while you think about how to improve the situation. Setting off everyone’s pager every five minutes for two weeks is not going to help. But once you’ve done that, it’s time to have some really serious conversations about what your next steps are and how you’re going to get there.
46:24
Liz Fong-Jones:
So what that means is that we need to talk about, for instance, prioritizing doing that major refactoring or prioritizing working on version two of the service. Or if we feel that the team is just not devoting enough energy towards the production sustainability of the service, we could potentially choose to hand the service back to them if we were operating the service on our own, and say, you know what, this is broken. Please keep both pieces and we think that you’ll do a better job of supporting it then we can. That makes a lot of sense, that setting the aspirational SLO and reducing our achievable SLO’s the right way to reduce that pager noise in the short term.
Great. To summarize, what we’ve discovered is that if our SLO is in danger but not broken, that the on call just needs to mitigate things and bring things back as quickly as they can, and that if the SLO is violated, that maybe there are some changes that we need to make in order to improve the operation of the service. But that’s not necessarily a problem until it happens month after month after month that we need to do something more drastic beyond just freezing feature releases.
And that for those situations, where the SLO is repeatedly broken, setting a different aspirational versus achievable SLO’s one step we can take as well as pushing for the feature work and the work on the backend of the product in order to make sure that we can achieve the SLO. And finally, we can choose to hand back the service or other ways to negotiate if we’re not able to meet the SLO that we’ve set.
So this was all grim and depressing. We were talking about sticks and punishments, but it turns out there’s a happier side to error budgets. Earlier, we promised that we’re going to tell you about how SLOs can really empower your teams to move so much faster. Let’s talk about what that looks like.
One way that we can think about the operation of our services is that we set a service level objective based off of user expectations, and we may or may not get lucky on any given month, but that the combination of designing our service to be appropriately reliable and doing some amount of manual reactive work to keep it that way keeps us within our ideal SLO range of what we hope to achieve with our service.
And that enables us, for instance, if we have a lack of luck that it’s okay to have acceptable errors happen, that those acceptable errors wind up taking the place of us getting lucky and overshooting our SLO and that that’s okay. That we say that it’s okay to serve, for instance, one in 10,000 requests as an error. That’s acceptable and that’s the cost of running a service at scale.
Kristina Bennett:
Maybe we can make use of those acceptable errors.
Liz Fong-Jones:
Yeah, maybe if we don’t have a number of errors happen in a given month, if we do get lucky, instead we can choose to run more experiments than we otherwise would. Maybe we can shorten the amount of time between releases. Maybe we can do feature flag based experimentation knowing that there is plenty of room for errors because we happen to have plenty of surplus error budget and we still will be within SLO based off of the construction and maintenance of our service.
All of this is good and this enables us to move faster, but what about the inverse case of where we undershoot our SLO, where we fail to do well enough. If we have a design defect, it means that our service is not as reliable as we hoped. Then it means that we have to do some amount of manual work to make up for it.
And if we can’t do enough manual work to make up for it, then instead we wind up in a situation where no amount of manual work will keep us within our SLO, and in that case, we need to really stop and re-prioritize things according to those error budget policies that we talked about earlier.
So we might, for instance, choose to do a feature freeze and spend time working on constructively improving the reliability of our service and the design of our service, so that ideally in the next month, what we’ll wind up having is a situation where we’ve made the reliability improvement and, with a normal amount of work rather than heroism, we’re able to achieve the acceptable number of errors, or even have a little bit of room for experimentation leftover.
That’s really part of where this school of thought of site reliability engineering and service level objectives comes from, where service level objectives were invented as part of the Google site reliability engineering practice. And that SREs at Google are responsible not just for setting SLOs but also for making sure that they’re achievable, by making sure that we’re engineering the right reliability into the design of our services and then doing the right amount of manual work in order to keep ourselves within the acceptable SLO, and using that to inform our decisions about both what the acceptable SLO is as well as what we should work on from an engineering perspective next.
So what Kristina and I want to leave you with here is this idea that SLOs cannot be taken in isolation, that SLOs are a tool in combination with making engineered improvements to your service and having a sustainable operations model, that this can really improve the productivity of your teams.
Kristina Bennett:
And because of this, in the SRE team we really believe that even a very simple SLO is something even better for you than no SLO. Having even a simple SLO that sets a target and helps you define and remember what matters most to your users and, therefore to you, really helps you remain focused and understand more quantitatively how your performance is doing relative to where your performance would be if it could be. It gives me something to measure and track.
Liz Fong-Jones:
And if you’re an existing Honeycomb Enterprise customer, you already have access to our SLO feature. So please do make use of it because if you’re already ingesting the telemetry from your users critical user journey events into Honeycomb, we encourage you to track this view in aggregate, even if it’s imperfect because you can always improve upon your SLO as you discover things you want to work on. But as Kristina said, what you don’t measure really will hurt you.
Kristina Bennett:
And as Liz just said, this is always an iterative process. Even if today you have the perfect SLO, you will still need to continue to improve it as both your business and your users evolve. So start with that simple SLO and just see how it works. You can start there and start to iterate on your metric. Try to find a metric that is more representative of your users or more efficient to calculate. So start to add better alerts for your different burn levels. Add better policies for how you are going to respond when those alerts fire.
53:28
Liz Fong-Jones:
Yes, and this is definitely another area where once you are confident in your service level objective and it’s quality, then please do turn on the burn alert feature that we’ve built into Honeycomb SLO because that’ll let you know when you have less than four hours of error budget remaining or less than 24 hours of error budget remaining based off of your past behavior. And that’ll enable you to have a higher signal to noise ratio in your alerts. But you also need those error budget policies. You need to decide in advance what your organization is going to do if that error budget is breached.
Kristina Bennett:
And remember that you really can do this. It’s an iterative and incremental improvement all the time. You can start with something very achievable and then from there, start to understand how it relates to your feature velocity and your reliability levels, and how you can make this compatible with what your users needs and what is practical for your company and your team.
Liz Fong-Jones:
It’s definitely a really important SRE practice to have SLOs, but you don’t have to be an SRE to have SLOs. SLOs can benefit you if you want to understand what users are experiencing with your service and can benefit you if you want to move as fast as you can while maintaining an agreed upon level of reliability.
Kristina Bennett:
And while we’re doing our best to give you a great overview here of how SLOs and error budgets can work for you, it’s a complex topic and we strongly recommend that, if you want to, you can seek out more resources on the topic. Here’s some books that could help. And the first two of them, Site Reliability Engineering and the Site Reliability Workbook are available for online reading at google.com/sre.
Liz Fong-Jones:
And if you want to purchase a copy, they’re also available from O’Reilly, the publisher. So please do have a look at those. In closing, I wanted to thank you very much for joining us, Kristina, and thank you very much for those of you who are watching on the webcast. This is part three of a three part series. If you missed parts one and two, they’re available for on-demand viewing. And as always may the queries flow and may the pagers stay silent. Thank you very much.
If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.
#1 SLO Theory: Why the Business Needs SLOs #2 Get Started: Build One Simple SLO –>