SLO Theory: Why the Business Needs SLOs
Danyel Fisher, Principal Design Researcher at Honeycomb, and Nathen Harvey, Developer Advocate at Google, speak in a three-part webcast series where they explain the value of Service Level Objectives (SLOs) for production systems and share their thinking about the importance of creating SLOs for your business.
In this first part of the series, Danyel and Nathen go over:
- What SLOs are exactly, and why you should care.
- What other “service level” terms mean, like SLIs and SLAs.
- The theory of SLOs and why your business needs them.
- Defining the right SLOs for your business and tracking/iterating on them over time.
- How Honeycomb uses SLOs and how we’ve designed them to work.
- The interrelationship between SLOs and observability.
Watch the full video below!
Transcript
Danyel Fisher [Prinicipal Design Researcher | Honeycomb]:
Okay. Hello and welcome to our webcast on Production SLOs: Success Defined. This is a three-part webcast series where we’re going to explain the value of service level objectives for production systems. The theme of this series is Success Defined. Our goal is to take you on a journey and share our thinking about the importance of creating SLOs for your business. We want to help you figure out the best SLOs and go about defining them and tracking and iterating on them over time. SLOs are truly unique because they bring a number of different stakeholders together to agree on the important things to measure for your app or service. So let’s dive into the first part of this series.
In this first webcast, we’re going to talk about SLO theory, why your business needs SLOs. In this talk, we’re going to talk about what SLOs are and why you should care, we’re going to define some of the terms that you’ve heard like SLIs and SLOs, and talk about how to pick them, we’re going to argue that SLOs are a critical part of SRE practice, we’re going to talk about how Honeycomb uses SLOs and how we’ve designed them to work. And last, we’re going to talk about the interrelationship between SLOs and observability.
I’m Danyel Fisher, I’m the Principal Design Researcher from Honeycomb.
Nathen Harvey [Developer Advocate | Google]:
And hi, I’m Nathen Harvey. I’m a Developer Advocate at Google.
Danyel Fisher:
Together, we’re going to be talking about some of our work.
SLOs are a data-driven way to measure and communicate how production is performing. We’re basing those on measures that your customers care about because we want to describe how broken things are and to be able to budget the remediation. What I mean by all that really is that we’re trying to find a language that engineers and business stakeholders can share.
In my personal experience, I’ve worked with managers who are very concerned about reliability, but perhaps less eager to bring out new features. After all, they wanted to make sure that the system stayed up at all times and got very concerned about any alerts. And I’ve had the reverse, times when upper management was pushing for new technologies to come out while engineering was saying, “Hey guys, slow down a moment. We’re really trying to get things stable.” What we’d like to do is find a way that we can find a balance between how many noisy alerts we’re generating and that we can help people focus on where their effort is going to make the best success in order to make their systems more reliable and more usable. I’m going to turn it over to Nathen to explain a little more.
Nathen Harvey:
Yeah, Danyel. I love that way that you’ve set us up here, and I really think it comes down to this idea of incentives. How are we incentivizing the work that each part of our team is doing? In a typical sort of more traditional engineering organization, we might turn to developers and say, “Your job is agility, your job is to build and ship features as fast as you possibly can,” and then we might at the very same time, turn to the operators that are responsible for maintaining those systems and keeping them up and running and say to them, “Your job is system stability. Whatever you do, do not let the system fail. Do not let the system go down.” And when we say these things to each of those groups, we’re really setting up some real tension and friction between the groups. So we come back to this question of how can we actually incentivize reliability? How can we get these two teams working together towards the same end goal?
04:05
Nathen Harvey:
And like you mentioned, sometimes you have even folks in the business that are very concerned about the reliability of the system, they want it available all the time. Again, that can provide some tension with should we be shipping new features out to our customers? After all, it is what our customers demand and want. At the same time, they demand and want a reliable system that they can grow to depend on. So we really want to incentivize this reliability, but we don’t want to over-correct too much and freeze all feature development but we do have to be realistic as well.
And one of the things that we know when we think about being realistic is what is it a realistic goal? And Ben Treynor Sloss, who helped found site reliability engineering within Google, is known for this quote, “100% is the wrong reliability target for basically everything.” It doesn’t matter what you’re building, we know that getting to 100% is probably not the correct reliability target that you should be shooting for. Danyel, what are some of the reliability targets that you have within Honeycomb?
Danyel Fisher:
That’s a great question. We’ve been thinking about that question a lot. So Honeycomb collects user data for example, and it’s really important to us to not lose user data. If we were coming to anything that was close to 100%, we’d say that pretty much all user data should come in. On the other hand, people are used to websites that might have at least a little bit of flakiness or a little bit of delay once in a while, the internet is a difficult place, so our UI should almost always work or very often work. And behind that, we have a query engine that’s connected to a series of large and fast database as well. They should almost always deliver the response. Once in a while, you’re going to have to wait for a query response.
Now, I’ve used terms like almost always versus sometimes, because I’m trying to get at this idea of not quite 100%. I’ll try and quantify those in a little bit.
Nathen Harvey:
I think this is also a really great example of how you can have these conversations with the engineering organization, with the operators and with the business side of the organization to have this principled way to agree on what is the desired reliability of a service. And in fact, having those discussions is fundamental to the practice of SRE or site reliability engineering. So one of the things that we have, or one of the practices that we utilize within SRE, is this idea of an error budget. And an error budget is exactly that, it’s a way to talk about the desired reliability of a service. It’s also something that can be measured over time.
So an error budget is an acceptable level of unreliability. If you think about 100% as being the top level of reliability, we recognize that we cannot get to 100% and in fact, trying to get there is probably a fool’s errand. Our customers don’t demand it and the engineering and other resources that we would have to put in to achieve that goal simply aren’t worth the investment. So we have to come up with an acceptable level of unreliability. And then with the error budget concept, we use the delta between 100% and whatever is the acceptable level of reliability, that becomes our error budget and we can actually allocate that budget.
And our goal as engineers is to measure those objectives and we want to meet those targets and maybe be slightly over those targets, but we certainly don’t want to go too far beyond our actual target. When we do so, we actually are starting to reset the expectations of our customers. And then when we do meet something that we’ve decided is within the bounds of our objectives, if we’ve already reset our customer’s expectations to expect more, then they may get disappointed.
08:22
Danyel Fisher:
Nathen, I’m going to pause you a moment there and say this is really unintuitive. At least it was when I was starting this off, the idea that I want to be building my system to sometimes disappoint people.
Nathen Harvey:
Yeah, that’s… I love that you put it that way. And Danyel, when I think about it, I always go back to the prototypical engineer, and one thing I know about every engineer is we are lazy people. We want to do just enough work to get the job accomplished. We want to write only as many lines of code as we need to. We want to put in the minimum effort to get the maximum return. And when you start thinking about your systems, this is also a way to encourage us to be lazy. We want to make sure that a typical user is happy with our service, but we’re not trying to really blow out their expectations and take them above and beyond. We have to agree what is reasonable, what is going to keep our customers happy, keep them using our service, and keep them coming back. And let’s make sure that we achieve that. When we overachieve, we do reset those customer expectations and then that really puts some constraints on us on the other side.
Danyel Fisher:
So there’s a trade-off where if I’m doing too well, then that forces me to slow down my velocity and stop me from dividing my users in other ways?
Nathen Harvey:
That’s exactly right. Yep. Yeah. But let’s look at what does this actually look like in practice. As I mentioned, we’re engineers, I want to talk about numbers. I don’t want to talk about sometimes and almost always and never. Let’s put that down to real numbers. And in fact, when we start asking these questions of what does it mean for a system to be working, how many nines do we want for our reliability? So as an example, you’ll often hear us talking in terms of three nines, four nines, five nines, two-and-a-half nines. What we’re really talking about there is the percentage of reliability. Maybe that reliability is translated as availability, so how often is the site available? And if we say that it’s available 99% of the time, that means we can be down for up to 7.3 hours across a month. So it all comes down to simple math, but it is good to understand the math so that you can set reasonable expectations and reasonable objectives for your services.
Danyel Fisher:
It’s really remarkable to watch how quickly each nine drops off, which I guess I mathematically knew, but seeing the leap and picturing my response team going from 43 minutes a month, okay, that’s something that I can sort of picture being on top of, to four nines is 4.3 minutes. That’s a terrifying number.
Nathen Harvey:
Right. And think about that five nines number. By the time your system has alerted you to an outage, before you can get your hands on a keyboard, you’ve blown through that 26 seconds. But this also comes to the question of what does it mean for a system to be working? And so we’ll talk soon about how to set appropriate SLOs and how to actually measure those things.
11:33
Nathen Harvey:
But one of the things that’s really important about the error budget is we’re going to use our error budget to determine what engineering works should we prioritize. When we start off on an iteration or a sprint or however you organize your work, you may have this question all the time, should I work on features that are going to delight our customers or should I focus my engineering effort on things that make the system more reliable? The answer is likely, yes, but what happens when you’ve blown through your error budget or completely exhausted your error budget? You have to have some consequences that you’ve agreed on, how are you going to shift your behavior.
So in the most drastic of consequences when you’ve blown through your error budget, you have no budget remaining, what you do then as an engineering organization is you stop all feature releases. No new features. We’re only going to focus on reliability. I mentioned that’s the most drastic because, well, listen to it, it sounds pretty drastic. We have now outage, therefore no new features for another month. That’s pretty drastic.
What are some other things that we could do? Some things that we could do might include items like prioritizing the postmortem items. So the last time we had an incident, we did an investigation, we found some actions that we wanted to take to help improve the system and we haven’t prioritized those yet. We’ve now blown through our error budget, so let’s put some engineering effort behind those. Maybe automating deployment pipelines. You’ll notice there’s an asterisk there. If your deployment pipeline is kind of bad, don’t automate it. All you’ll do is make the bad go faster and more consistently. Anytime you pick up an automation project, that’s a good time to reevaluate what are the goals of this system or this process. Let’s make sure that we’re automating the right things. Of course, you can also use this time to help improve your telemetry and observability for your applications. That’s a great place where Honeycomb can come in and help out as well.
And then the last item that I have on the list here is requiring SRE consultation. Perhaps when you’ve blown through your error budget, what you have to do as a development team is to invite an SRE into your daily standup. That SRE might start asking questions like, “Oh, I see you’re calling out to a third party service here. What’s your retry logic? How do you back off if that third party service seems to be overloaded? What are you going to do when they have an outage?” Just asking these questions to help build more operable applications can really help those development teams move forward. Now… Sorry, go ahead, Danyel.
Danyel Fisher:
As you’re saying this, I’m thinking about a timeline. Honeycomb had a recent outage and it was sitting on one of our backend systems that’s meant to take in user data, it’s the core system that brings in every user event and stores it. It’s interesting to look through this because I think we actually did a pretty good job of stepping through every one of these, but sometimes with an ironic twist. What I mean by that is every incident really is always a story of multiple failures simultaneously. In this one, we were attempting to observe the telemetry and observability of the build process itself, so we built a wrapper around our build process that would emit events when different stages of the build process failed or worked.
Unfortunately, it turned out that there was a small bug in there that swallowed error codes. Now that wouldn’t have been too huge a problem, except our system that deployed the code also had a small bug in there that didn’t notice if it was deploying empty code, which also wouldn’t have been a problem if someone hadn’t checked in code that didn’t build. The combination of these three bugs meant that somebody checked in code that didn’t build, the telemetry system very happily reported that as a successful build, and the deployment pipeline very happily shipped it out to our customers. And because we had a fully automated deployments pipeline, it copied the null program onto every server.
The good news is we’re very good at rollback. The bad news is that it took us a good five to six minutes to find out that this had happened. And it’s definitely true that after that we did freeze feature releases while we slowed down and took another look at how we were building our deployment pipeline, how we were monitoring this and tried to make sure that each and every one of those different failures was checking in so that the next time those… None of them would be able to take us down, much less all of them.
16:17
Nathen Harvey:
That’s a great story. I’m glad that you’ve lived through that, but it does certainly highlight this idea that, as you mentioned, when something goes wrong, it’s this confluence of failures. It’s not typically a single thing that failed that brought the system to its knees or caused an incident or required human intervention. But when we talk about an error budget and some of the consequences that you might face, it’s not always bad, especially from a product development point of view. If what I want to do is ship features, we can also use the error budget to allow us to do the things that we need to in the system.
So yes, when we’ve overspent our error budget, there are consequences, but while we have error budget remaining, we have this other question, what should we spend our error budget on? In fact, we know that when there’s too much remaining error budget, that’s an indicator from the system that we probably aren’t moving fast enough. And the challenge or the downside of not moving fast enough is you can’t put new features in front of your customers and learn from them, learn from that customer behavior as fast as you might like. So with this error budget in hand, we can actually accommodate a bunch of things, releasing new features of course, but also expected system changes.
Think about that last system update that you had to make or patching that you have to do across all of your systems, it may have been hard to prioritize that work. You can use the error budget as a guide to say, “Hey, we have error budget remaining, let’s go ahead and take on that system update or that system maintenance that we have to do.” You can also use that error budget to work out inevitable failures in hardware networks, et cetera, you can use it for planned downtime and even for risky experiments.
One of the things that I love about this idea of risky experiments when you first set your goals and objectives, Danyel you even talked about this… this idea that sometimes you’ll have to hit refresh on the page in order to see the data within the application. Well, how do we quantify sometimes, especially when we ask the question from a user’s perspective, what’s an acceptable level of sometimes? A risky experiment may allow us to play with those numbers. We could potentially introduce some latency or something that would cause more users to have to refresh the page just so that we could test where does the user satisfaction really starts to fall off to make sure that we are meeting that objective. So we can use the error budget to help fine-tune the goals that we’re setting as well.
So let’s talk about the actual process of creating an error budget. We’re going to see that an error budget is made up of a couple of different things, we’ll start with some of our implementation mechanics. So the first thing there is that we want to evaluate the performance of our objective over a time window. So that time window might be something like 28 days, which is important because we want to always be sort of looking back on how are we performing against the subjective? And remember that whatever we have left, that’s going to be driving the prioritization of our engineering effort.
19:43
So now I want to introduce a new term to you, the SLI, or service level indicator. Now at its heart, the service level indicator, an SLI, is simply a metric, but not every metric is an SLI. Excuse me, but not every metric is an SLI. An SLI is a quantifiable measure of service reliability. It is important that we think about these SLIs from a customer’s perspective.
So as an example, a good metric that you might know about your system is something like the CPU utilization. I’ll be frank, none of your customers care how utilized or under-utilized your CPUs are. That’s simply not something that they care about. So when we talk about choosing a good SLI, we really need to think about what is an indicator that our customers will recognize or that they will feel that will tell them what’s good or bad about the system.
Now, there’s also this question of what does it mean to be good? Think about any service that you use, what does it mean for that service to be working? And as soon as you start thinking about that, you’re going to come up with so many different answers. So we like to really think about what are some of the events that indicate whether or not the system is working. At a very simple example, we might talk about the availability of a page. Now when you think about a page, is that page up or down? There are so many assets and artifacts that create that page we don’t really know. Or you can’t just simply say it’s up or it’s down. What you want to do is measure all of the requests coming to that page and understand that all of the valid requests are the sort of the denominator and the good requests are the ones that we’ll divide that by. And that will give us a percentage to help us understand what is our SLI.
So it looks like this. We count all of the valid events, all of the requests to that page, all of them that have a good response go on the top of this equation, and then we multiply that by 100 and we end up with, I don’t know, some percentage, maybe that’s 99%, maybe that’s five nines. Who knows? But then… Sorry, go ahead, please.
Danyel Fisher:
So this is fantastic because in Honeycomb language, we, in fact, think of the world as a series of events coming in.
Nathen Harvey:
Indeed.
Danyel Fisher:
But for any given dataset, I might have both say web events and also database events and also, the Kafka events that communicated between the two. So I guess that’s what we mean by the valid events choosing the subset that I care about.
22:27
Nathen Harvey:
That’s exactly right, you’ll choose the subset that you actually care about for each one of your service level indicators.
And then once you have those, you can then set some goals around them. So as an example, a quick example here, let’s say I have a web service or an application that’s on the web, and I’ve worked with my business stakeholders, we’ve determined that getting to the user profile is super important for our users and we want to measure the reliability of that in a number of different ways. First, is the user profile page available? But being available as you know, isn’t simply enough. We also want to look at, is it fast? So we’ve decided that it should be available and it should be fast. What does that actually look like in terms of numbers?
So now that we know we can set an objective for each one of those, so we might say that 99.95 or a three and a half nines of all of the requests across the previous 28 days, should be successful. And when it comes to latency, we might say that 90% of the requests across those same previous 28 days should be served in less than 500 milliseconds.
Now, as long as we’re meeting these two criteria, you can say that we’re within our error budget or that we’re meeting our service level objectives. And when we stop meeting those, then we’re outside of our error budget or we’ve overspent our error budget, and that’s where we’re going to make the decision what should we prioritize from an engineering perspective?
Danyel Fisher:
So I have a pretty intuitive sense of what it means to be 99.5% successful. Can you tell me more about how you would decide whether you wanted a 28-day window versus a since the beginning of time window?
Nathen Harvey:
Yeah, that’s a great question. So first, I don’t think you ever want from the beginning of time window because you’re never going to reset that. And so the minute you drop below that number, sure, you’re going to continue to move forward, but that may have a big impact on how you prioritize work. I recommend that when you’re just getting started something like a 28-day window makes a good reasonable place to start. 28 days in that it’s, it’s not a month where you might have some seasonality like weekends as an example. I don’t know how your service is used, but you do, so does seasonality matter even across the week? Do you have higher traffic on the weekends than you do on weekdays? So if you use something like 28 days, you have consistent buckets that you’re always looking at. Whereas if you looked at a month, obviously some months are shorter, some months are longer, so that’s introducing some sort of differences as you’re looking back over time.
The other thing that I will say is important is that you can look at the existing data that you have and probably plot that data back across 28 days or across some period of time and understand where are you currently performing? Use that to help set those objectives. And then the final thing I’ll say about how far back you should look, I think it’s important to remember that you’re going to use this information to help drive the priority of the work that you’re doing. This may, in fact, drive whether or not you wake someone up at night through some alert and it may certainly just help you decide what engineering work they’re going to prioritize, so you don’t want it to be too short.
You don’t want to look back, say, over the last hour because if you look back, excuse me, only back an hour or so, that sort of is an indication that you’re going to be switching your engineering priority potentially every hour. That seems like it’s an unreasonable amount of time, an unreasonably short amount of time. So you may let something like how frequently you’re running your iterations or your sprints determine how far back you look. But as a general rule, I would say start with 28 days. Learn from the system and then adjust as necessary.
Danyel Fisher:
Okay, that’s fantastic.
26:42
Nathen Harvey:
All right. And the most important thing to remember is that we’re doing this because we are trying to make our customers happy. We really have to take a look at these service level indicators and service level objectives from the customer’s point of view. Not from our system’s point of view, not from an engineering point of view, but it is all about how do we keep our users, the users of our service, happy and coming back to our service and hopefully growing with us.
Danyel Fisher:
So we’ve heard a couple of different terms this session that I’d like to go back and make sure that we’re all feeling comfortable with. The term that’s probably most externally familiar is the idea of a service level agreement. Those are business agreements with your customers and those often specify contractual terms. They tend to be a little bit more abstracted from engineering. The part that we care about in this conversation is the service level objectives. A service level objective is, as we just said, a measurable characteristic of an essence of an SLA. It’s that thing that we can describe that was 99% reliable over the last 28 days, a service level agreement might have many different SLOs embedded in it. And then as Nathen was just explaining, we want service level indicators. Those are the metrics that we actually use to evaluate the SLO. And an indicator tells you whether an event is valid and then if so, whether it’s good or not.
So let’s then come back to what we were saying earlier, Honeycomb doesn’t really want to lose customer data, and we’ve put that at four and a half nines. So API calls should be processed without error in less than a hundred milliseconds. When I said the app works, press refresh if it doesn’t, what we meant by that is we use three nines on that. One in a thousand times that people try to load a webpage it’s okay for them to have to press refresh or to have to wait more than a second to get the page back. And the database is at 99% to have a ten-second latency, so we’re pretty tolerant of a little bit of waiting after you press the run query button to get a response.
29:04
I’d like to show a little bit about how SLOs end up being implemented in the Honeycomb system. What you’re seeing here is the UI for our SLO feature, and we’re looking specifically at one failure. In fact, we’ve got a description of it on our blog, a time when our shepherd load balancer failed for us. You can see that at the moment that I’ve grabbed the screenshot, we’re running at negative 12.6%. So in the top left corner, we’re looking at our budget, how much of the error budget is remaining after those last 30 days? And what we can see is that for the first four or five days things were doing pretty well and then sometime around November 5th we had a pretty substantial outage that collapsed. And ever since then, we’ve been doing our best to sort of recover but we’re not quite there yet.
To the right of that, we have our historical compliance chart that shows, for each day of the past 30, how often the SLO succeeded over the proceeding 30 days. So up until November 5th, we were looking like a five nines service there. After November 5th, we dropped down to just below our four and a half nines state, and we’ve been floating around there since. You can also see in the top right corner that we have an exhaustion time warning. What we’ve done is we’ve estimated, and I’ll show you this in a little more detail in just a moment, how long it will be before you run out of budget. Since we’ve already gone out of budget, that’s already labeled as triggered. At the bottom, we have a heat map and a Bubble up, which I’ll talk into in a little bit more.
So as I said, the burndown chart, shows the total events as of today, where did our budget go over those last 30 days. So what we’re seeing here again is this history of where the budget went and what days you burned it down.
I had talked about the idea of alarms so let’s extend that concept just a little bit. What we do is we take the last little while and we use that to predict when you’re going to run out of budget. It’s the if things go this way, how long will it be before I run out of budget? And the reason for that is because we’d like to be able to modulate these sorts of warnings. We want there to be a difference between, “Well, in another 24 hours you’ll probably be burning out and in two hours… ” If you’re going to run out of budget in two hours, it’s probably time to declare an incident, get a team together, get things fixed right now. Well, if you’ve got 24 hours, you can probably look at that in the morning. In fact, at Honeycomb, we’re finding ourselves setting our 24-hour alerts to… Those tend to put out a Slack notification somewhere that people will see in the morning. While the two-hour alerts go directly to pager duty and they wake up the on-call person.
32:21
Nathen Harvey:
I love that you’re doing that, Danyel, because I think that in our industry we have this tendency to over alert, and that really does a lot of harm to the humans that are responsible for looking after these systems. And there are plenty of times where I was a sysadmin have been woken up about a problem that did not need immediate attention. And as you know, when you get woken up in the middle of the night, it’s not that you can act an alert and immediately fall back asleep, that interrupts your sleep cycle. It makes you a less productive employee the next day. And so I think that this is a really great use case for how we take that error budget and our service level objectives and allow that to drive how we treat the humans in this system as well.
Danyel Fisher:
Oh, that’s a really nice phrasing. I like that a lot, especially because when you come in the morning well-rested, you’re actually probably more likely to be able to remediate things as opposed to at two hours you’re going to just be trying to plug the dam in whatever way you possibly can.
Nathen Harvey:
That’s exactly right.
Danyel Fisher:
The other thing that Honeycomb is pretty proud of is our ability to help break down precisely what happened and look at it across many dimensions because we pull high dimensional events. What I’d like to do is take a look for example, at this heat map. I talked about how we were looking at this failure on November 5th and 6th, when I zoom in on that period of time, and you can see in the top right corner that we’re now looking at November 5th to 7th, you can see that that entire SLO budget burn happened over a series of four incidents, one at about midnight, another at about three in the morning, and then two more later. Between the four of them, they burned through all of our error budget. The yellow dots are the ones that show errors, while blue dots are points that were successful.
On the bottom what we’ve done is we’ve taken the specific events that failed and we’re comparing them to the events that were succeeded and seeing whether there are systematic differences between them. So it won’t surprise us to see, for example, that the yellow dots had an error status while the blue dots don’t. That’s the very definition of being errors. It’s a little more interesting to look at the right and go see that our elastic Load Balancer says that it was specifically error 502s and 504s that our system was generating during that.
We can look at the other dimensions and go quickly see that it doesn’t look it was a specific traces, or a specific client, or specific backends, or specific user agents that were generating these errors. So we’re beginning to narrow down pretty quickly on what caused this to be able to rule out some of the hypotheses that will allow us to be able to remediate this issue and get ourselves back up.
Overall then, I think we’ve had a chance to look at a number of different things about SLOs and SLIs. The idea of establishing SLOs is a group effort because it involves both understanding the user needs, and engineering’s abilities, and bringing in SREs. Therefore it’s cross-functional teams, people who have interests in understanding the business needs, as well as understanding the ops needs.
We’d encourage you to start with SLOs that measure reliability, and this is a way of adhering to SRE practices. As Nathen was saying, we want to be realistic and choose meaningful targets that can reflect the sort of levels that we can maintain. We should understand the consequences of what happens when you blow through your error budget and you should be able to figure out what you’re going to be doing next. And one of the things that we didn’t talk about as much here, but I think is really critical, is the idea that you’re going to be measuring and iterating on these. These aren’t something that you set once and then are etched in stone.
36:34
Nathen Harvey: Yeah, I think that’s a really important point as well. As you get started here, you have to go into this with an intentional mindset, intent on learning. You’re going to learn more about working with other parts of your organization, you’re going to learn more about how your systems work, and you’re going to learn more about how your customers are using the systems that you’re building for them. I think it’s super important that you recognize that you can’t set the right SLOs the first time you try. You’re going to iterate. You’re going to learn over time.
Danyel Fisher:
To give a really clear example of that, actually, when Honeycomb set our first SLOs, we were triggering on did it produce an error 500? Because an error 500 is an error. Very quickly we realized that some of the error 500s we were sending were, oops, it’s our fault, we broke something. And a larger percentage were, “Hey, the user sent us invalid data and we wanted to let them know.” We now have a new field in our telemetry called is it their fault? And is it their fault doesn’t count against our SLOs. Interestingly, our customer success team has picked up that, is it their fault, and they now use that to be alerting to know which customers to reach out. So I actually feel like having done that initial investigation, both empowered the engineers to be able to describe what the scope of what they felt was things that they could work on and it also empowered customer success to be more focused on helping make sure that our users are being successful with the product. It was a really interesting side effect.
With that, I think I’m pretty much where I wanted to be. Nathen, anything that you wanted to add?
Nathen Harvey:
No, I think this has been great. Thanks so much, Danyel. It’s been fun chatting with you today.
Danyel Fisher:
Nathen, thank you so much for joining us. I’m going to encourage all of you to come back for part two. You’re going to hear a different Nathan, Nathan LeClaire, a solutions engineer at Honeycomb. He’s going to actually walk through the process of building an SLO inside Honeycomb, building it against our tools, looking at data, and making decisions about how we can set those performance objectives in order to diagnose how things are working. With that, I’d like to thank Nathen for joining us and thank you for listening in.