Webinars SLOs Incident Response Debugging Customers
How to Create Happy User Experiences by Leveraging SLOs: The ecobee Story
Summary:
Service-Level Objectives (SLOs) reduce alert noise and help your team refocus on creating more resilient services by aligning engineering and the business on the goals that matter. They’re an invaluable feature that enables teams to provide the best customer experiences. But if SLOs are so great, why aren’t more teams using them? For some, they fear it will take too long, while for others they’re concerned about vendor lock-in. Pierre Tessier, Solutions Architect at Honeycomb, and Chuck Daminato, Staff SRE at ecobee, take the virtual webinar stage for a discussion on how to get started with SLOs. Chuck shares the process his team followed to get up and running with SLOs. They also debunk common myths to help you feel more confident in the process and excited about the outcome. If you’re interested in learning how SLOs can help your team create happy customers, this webinar is for you. Join Pierre and Chuck to learn how to: - Get started with SLOs: 3 easy steps to follow - Navigate the fears and myths to get buy-in throughout the organization - Speed up problem-solving using SLOs
Transcript
Sheridan Gaenger [Senior Director of Revenue Marketing | Honeycomb]:
All right. And we’re live. Good morning. Welcome, everyone. We’re happy to welcome everyone to today’s webinar, How to Create Happy User Experiences by Leveraging SLOs: The ecobee Story. I’m looking at the clock. It’s about 10:00 a.m. Pacific. We’re going to give everybody a couple of minutes to get from virtual meeting to virtual meeting, and we’ll get started in about two minutes. Pierre and Chuck are excited. They’re waiting in the wings. We have a great discussion ahead. We have a great demo ahead. Everybody gear up for a fantastic 45 minutes we’ll probably take today, maybe a little bit more, but we understand if you have to jump off. I can see those participants climbing. Everybody is getting that second or third cup of coffee for their Wednesday. I’ve kept it at two today. Don’t need me jumping off the webinar walls today. All right. We’ll give it just another 45 seconds here.
Again, welcome, everyone. Good morning or good afternoon. We’re so excited to host today’s webinar, How to Create Happy User Experiences by Leveraging SLOs: The ecobee Story. Pierre and Chuck are about to take the stage. Before that, I want to cover a couple of housekeeping items before we get started. I just want to give everybody a couple of minutes to jump from one virtual meeting to the next and give time for a cup of coffee before we dive in. All right. I think it’s time to kick it off. 10:02. Welcome, everyone. I’m Sheridan Gaenger, Head of Revenue Marketing here at Honeycomb. This is week six for me. So happy to be here and happy to be the Emcee of today’s webinar, How to Create Happy User Experiences by Leveraging SLOs: The ecobee Story.
Today we have Pierre Tessier and Chuck Daminato. We’re very excited about this event. We want to make it as interactive as possible, but before we get going and before I introduce Pierre, just a couple of things. One, we are recording this session. You can see the little recording button in the top left. So should you have to jump or should you have a colleague who you think would benefit from this recording, we will make sure that we send this link to everybody post-webinar so that you can watch it on demand.
Second, we’re definitely taking questions throughout the webinar, in fact. We don’t want to just leave it to the end. So we encourage you to submit your questions via the chat or the Q and A box in the Zoom webinar platform, and I will be peeking at them. And Pierre and Chuck will be peeking at them. So please, please submit your questions. We want to hear from you. We want to be able to answer any questions you have. We’re here. We’re waiting. Please submit those. We’re also going to create a couple of polls this session. So be on the lookout. Again, we’re interested in creating the content and curtailing it just for you. Keep an eye out for those. We’ll launch those as soon as we get started.
I also wanted to introduce Kimberly from Breaking Barriers Captioning who is going to be providing live captions throughout the webinar. To follow along with the live captions, just click the CC or live transcript button and view the full transcript. We just posted the link in the Zoom chat, and you can see the stream in a separate browser tab. All right. I think that’s it for me.
I guess the last thing I will say is we are all at home. I’m joining you from Lafayette, California. We never know how home internets will serve us. Should something happen and we get knocked off, please bear with us, and we’ll work quickly to get things back up and running. That’s it for me. Who is ready to learn about Leveraging SLOs to create happy user experiences? I know I am. So let’s get started. I’m thrilled to introduce, as I mentioned, Chuck Daminato, SRE staff at ecobee. He’s joined by my colleague and actually my ramp buddy, Pierre Tessier, our Team Lead Solutions Architect at Honeycomb. So, Pierre, I’m going to turn it over to you. As I said, I am going to be waiting in the wings for questions. Please submit those so I can pop in and share them with Pierre and Chuck.
4:43
Pierre Tessier [Team Lead Solutions Architect | Honeycomb]:
Awesome. Thank you, Sheridan, so much for that. We’re here to talk about SLOs, but I want to start off by talking about why we got here in the first place. It starts off with something, a dashboard conundrum. What happens when something goes off and we walk in and we open up our new laptops and computers. Hey, I’ve got to look at this. And you look at a dashboard. What happens when you look at these dashboards. It starts off with, I’m going to be deploying a new service. Here are several charts to monitor this new service. Raise your hand if this was you before, creating that dashboard.
Now, keep your hand raised and only lower it if you don’t follow the next step because the next step is- the service breaks and the dashboard doesn’t tell me why. And this happens a lot. Right? You build a dashboard. You put eight charts on it, but none of the eight charts solve the problem. So what do we do? Dashboard now has a few more charts to monitor to service. This is typical.
Step five. And this is the loop. This is the dashboard conundrum. This is where we get ourselves into this problem where we keep on iterating and getting this sea of dashboards. The charts, we don’t know what they’re for. If your hand is still raised and you know you’ve walked into this before, you’ve walked into the dashboard conundrum. We depend on our on call engineers to wake up and just know what to do, and all they’re staring at is a bunch of meaningless charts. Somehow, they’re connected to each other. Maybe they’re on the wrong dashboard because somebody else built a better dashboard that nobody knew about?
We here to talk about SLOs. They can help us come to the rescue so we’re not always getting into the dashboard conundrum and we’re not always fighting and going through a sea of meaningless things. SLOs really help us understand. We want to measure what matters to your customer. And your customer here does not necessarily have to be an end user. It could be another service consumer of the service you’ve created. It could be another person internally. What matters to them to conserve your service?
And the next thing is, while we’re talking about the service, let’s focus on the service as a whole, not the individual symptoms, because every single chart in that dashboard was looking at a different symptom. That’s great, but if I punch in a bunch of symptoms into WebMD it’s always going to say the same results, that I’m dying in three weeks. Right? This is the nature of it. Symptoms don’t tell us the entire picture.
And really what matters here the most to make this complete, to have SLOs be something that’s useful, is the platform you’re using has to lead to actionable information from the SLOs. You can’t just tell me the status of what I’m doing. You have to tell me why I’m doing it and why it’s happening. And when you put these together with SLOs, the person on call, something happens, they go to their SLO screen, and they can see why. You’re launching a new feature. You go to your SLO screen and you understand if you’re able, if it’s risky enough, if you’re making your SLAs or not. You know why.
SLOs are to keep your customers happy but to keep your staff happier. They’re the ones that have to wake up at 3:00 a.m. to figure it out, and you want to make sure they have the right tools. You also don’t want to wake them up at 3:00 a.m. if it’s something that could have been dealt with at 11:00 a.m. Right? And, again, SLOs will really help us get there. Next, I want to start thinking in terms of events. What about uptime? Is it the same at 3:00 a.m. as at 3:00 p.m.? Do you get the same traffic at those times? Do you have the same user loads at those times? Probably not. So why do we always measure time?
Because it’s not really time we want to think about here. Right? Do you want to be successful in minutes, or do you want to be successful in things that you’ve serviced? That’s what we want to think about here. We want to be thinking in terms of events. Every time an event comes into my platform, anytime a transaction comes in, I want to understand that I service that transaction per my SLAs, yes or no. Regardless of what time of day it is, that’s what matters, the events coming in. That velocity is what matters. And if we think about it that way there, we can start thinking of what the flow of an SLO would look like. The first thing is to qualify it. Does this event qualify for us to even care about it for this SLO? Does service equal greeting? Yes, we qualify. So once we qualify, then, we can say, Let’s measure it.
10:00
What matters to our customer’s happiness? This is a really simple statement. Duration under 250 and status code is 200. Typical thing. We see this a lot in a lot of SLOs at Honeycomb, or something along these lines. The qualifications will be different, of course. You’re looking to see latency and no errors. If you’re good at that, we’ll see event count plus one. If you failed, we’re going to say event plus one and bad event plus one. Now I can compute my availability in terms of events. If I’m going to do three nines, that means every 1,000 events, I can have a bad one. I’m still at my SLA target. Four nines can be one out of every 10,000, so on and so forth. So can use this to build out that availability graph. A lot of times, we talk about error budgets. We’re going to show it in a demo. We’ll focus a little bit more on that.
But SLOs show how much availability you have left before you’ve violated your agreements. Before we talk about that, I would like to talk about adoption at Honeycomb. It’s a feature we rolled out a little over a year ago. Let’s just start with this one right here. This is a nice, fat, fancy number, 124. What 124 here is, is the number of SLOs for one of Honeycomb’s largest teams. It’s representative of about 200 services. They’re onboarding other services as well. But this is a good kind of way to think about it. Yes, this is a large customer. Yes, they have a lot. This is absolutely kind of an outlier, if you will, because they’re so bought in and focused on it. It’s working very well for them.
They’re able to service their customers both internally and externally. Internally is important here. They have contracts between the service consumer and service producers inside of this organization. And these SLOs are there to uphold those contracts for them. And, of course, they eventually have contracts when their customers are facing stuff and those SLOs bubble up to the executive branch. Here is another different number. This one here says greater than 8.5. The real number was like 8 point something, six, seven, whatever, some weird whatever, but it’s more than 8.5. This is the average number of SLOs per enterprise team, excluding the ones I just talked about.
And this is also not bad. We have a lot of small customers, a lot of medium customers, and even some larger ones. And this product is just over a year old. A lot of you are on this call today, or this webinar, because you’re trying to determine how to use SLOs to do better. There’s still a learning curve of people still trying to understand how to best apply them. Especially lately, in the past three months, we’ve seen a significant uptick of SLOs. Query load on Honeycomb systems has caught up to the query load from triggers, and it’s surpassing it.
We’re seeing fewer people build triggers and more people build SLOs. They’re doing less on symptoms and more about looking at the service as a whole. And this is what we want you to focus on when you think about building SLOs. It’s not about having 4,000 alerts. It’s about having a manageable handful of things that you can act upon. They will go off no matter what symptom is affecting it or a combination of symptoms is affecting it. A good SLO will also tell you in the future. It will go out and predict that, hey, if you keep up with this trend, this slow burn you’re going to violate your agreement in 24 hours. So, hey, we’re going to throw a ticket in your ticket system so when the person wakes up tomorrow morning, they can pick it up and fix it.
Those are the types of things we want to do with SLOs. So with that, a lot of you have come here. You probably have questions. You’re probably not using SLOs yet. We’re going to launch a poll right here. And this poll is going to ask everybody what it is that’s holding you back from getting started with SLOs. We’ve got a few options on there. I encourage everybody to please go ahead and answer that poll. Provide me with some feedback, if you will. One option was “other.” If you don’t fit into the other three categories, go ahead and check “other.” Throw at the chat what it is because we want to address the items inside of this poll. We want to make sure that people get what they need to get going.
I’m going to give people five more seconds or so. I see people clicking away. That’s okay. So let’s go ahead and tackle this one. We will ask the wrong questions. Okay. This happens. You don’t know what to ask. Chuck, I know you’ve gone over this exercise at ecobee. I usually say, What keeps your users happy? That’s the first thing you should be thinking about. What is it that makes users happy when they use your service? Is it clicking through and getting the experience? And what keeps them happy? Because if it’s slow, is that going to keep them happy? Probably not. So oftentimes, you know, response and error rates are the ones that you want to focus on right away because those are what make your users cringe.
15:56
Now, engineers don’t always have the answer here. A lot of times, it’s project managers and project owners. They’re a really good place to start. They had an idea when this thing was created. They have an idea of what that experience should be like, and they should absolutely be part of this conversation. But the wrong question, what you’re asking is, Are my users happy? That’s the biggest question you want to ask. The consumers here are consumers of your service.
Let’s take a look at another myth that we might have here. It takes too long to get started with SLOs. How much time have you spent firefighting? I promise it’s a lot longer than getting started with SLOs. Are you sending data to Honeycomb? Because if you are, you’ve already got 85% of the battle ready to get an SLO, 95% of the battle ready. You’re there. You’re ready to get going. We’re really just 10 minutes away from getting you running, 10 minutes away from getting you an SLO that will look back at your past 30 days of data, evaluate it in real time to tell you if you’ve gotten what you need.
So for all of you who ask that question or answered it that it takes too long, there’s a lot of great tooling out there that helps you get started with getting telemetry into Honeycomb. OpenTelemetry is a great ecosystem. We’re going to talk more about that too. Getting started with OpenTelemetry or any of the popular SDKs is just a few minutes long. Once that data flows into Honeycomb, we’re ready to go. We’re ready to really get started to help you put those definitions in place.
We have one more here. This is about vendor locked in. People are scared. They might be used to the old ways of how vendors, especially in this space, would collect your data and say, Hey, use my agent because my agent is the best. Then you’re locked into that agent. That’s cobwebs. We don’t do that at Honeycomb. We will provide you with open source SDKs. But more so, we embrace the SDK as part of the Cloud Native computing foundation, part of CMCI.
OpenTelemetry is the second largest project on CNCF, second only to Kubernetes. And this week is actually CloudNativeCon week with KubeCon. So, f you want to learn more, if you want to watch more webinars, I would certainly encourage you to go check out what they have going on during the European hours. This is a very active project endorsed by dozens, maybe hundreds of different companies that are all contributing to this. It’s supported by a wide range community, and there is no vendor lock-in when you’re using the OpenTelemetry set of SDKs. We’re not the only vendor in the space. I like to think we do it the right way, but once you have the data, you are able to take that and bring it anywhere you want to go.
These are the things that help you get started. When we ask these questions to people: Why are you taking so long to get started? It’s easy to get going. There’s great tooling out there for you. I’m scanning over a chat here. I can see kind of other business priorities or we don’t have an SRE culture really in place. There’s an event going on in a couple of months. We’re going to toot our own horn here. It’s called o11ycon and hnycon. It is an event that’s going to be taking place in early June. Probably a good spot to send your engineers to go participate and learn about some of the SRE best practices to get that culture instilled into your organization.
I’m going to hear about other business priorities. I often come back to what is more important than being down. If you’re down and can’t figure it out, how much money does that cost, and what business opportunities supersede those? Getting your applications running right and stable, and when something does go wrong, being able to repair that and come back to a normal state quickly, these are very important things that we should all be focused on as top priorities. Certainly things we went on to encourage people to do to get there. Now, I am also going to say that SLOs are meant to be iterated. Certainly, if you have no SLO, that’s worse than a good SLO. And it’ll take time, but you’ll eventually get to the perfect SLO. And the perfect SLO might be more than one to get to the right angle there. What we want you to focus on is understanding what barely keeps your users happy so that one or two errors are fine, but five in a row, probably not fine and you want to be on top of those.
Again, when we’re building these out, when we’re defining your SLOs, get started and get something out there. Be close. You don’t need to be perfect because close is a lot better than nothing, and it’s probably better than your 28 alerts on 28 symptoms because it was the 30th one that actually triggered it. I’m going to pause here. I’ve been talking a lot. I probably want to grab a glass of water, a small, quick, little drink. I want to pause here. I want to talk to Chuck here. Chuck, you’ve done this, starting from scratch, effectively, getting up and running, and getting Honeycomb on the backside. Maybe you can talk to us about how you went through it to get this done.
22:03
Chuck Daminato [SRE Staff | ecobee]:
Absolutely. Hello, everybody. My name is Chuck, Staff SRE at ecobee. My area is primarily focused on authentication and authorization on our platforms. So the first step, always instrument your application or your service and get that done at Honeycomb. The easiest way, as Pierre mentioned, is using OpenTel libraries, which are manifolds, a variety of different languages, frameworks, whatever you need. If you don’t find what you’re looking for, Honeycomb has SDKs, which they call Beelines, which you can plug into your app. Very similar to the OpenTel framework. Worst case scenario, say you’re working in a serverless environment like Cloudflare, you can always use their API. The folks at Cloudflare have actually even come up with something for their Cloudflare worker. We can provide a link later if that’s necessary. Ultimately, just get your data into Honeycomb.
One question I’m often asked is: What data should I send? Should I send one endpoint? What the API responses are? What the latencies are? Any metadata? And my answer is, Yes. Send it all. One of the benefits of using a service like Honeycomb, it allows for unstructured data. You can send whatever you need, whatever you want, and then you can filter through it later. The next step after that is to ask yourself the question: What behavior does our app or service provide that impacts the user experience? This is like writing user stories if you’re used to Agile. Ask yourself: What would make my customer happy?
For us, as a user, I want to be able to log in quickly. If I give the correct username and password and multi-factor authentication, I should be able to log in. This should happen relatively quickly. It doesn’t have to be lightning-fast, but if I’m sitting there for a few seconds, I’m going to be concerned. Or, if after I’m logged in, if I want to do something with the ecobee app or service, can I get my thermostat settings? Can I ask Alexa to change my temperature? Can I look at my camera to see what’s going on in my living room? Whatever the case may be. You can try this yourself. As a user, what do I want? How do you want this to go?
A happy customer is a healthy, non-error response from your services in a timely manner. This is what Pierre had with HTTP code 200 in 150 milliseconds, whatever the case may be. These parameters will vary based on the service. You’re the subject matter expert, but this is kind of the formula. Once you know the question you’re asking, well, this is the slightly, quote/unquote, complicated part. You know the questions you want to ask. Now, you’ve got all your data in Honeycomb. You just need to make a derived call. Basically, this means that you’re taking your data and popping it into a formula. Honeycomb provides language around this.
What you want is a true or false. If it’s true, it’s a happy event. If it’s false, it’s a sad event. A question I could ask is, can my user log in with the correct information within Y seconds? Or can my user see the settings within X seconds? If there’s an error, this is a violation. If not, we’re good. This is all you have to do. Once you have this derived column, which just gives you a true/false on your behavior, you literally press a button. Create SLO. Select your derived column. Honeycomb does the rest. Profit.
25:42
Pierre Tessier:
I love that. Hit a button, create the SLO, profit. Sit back and watch it work. People are probably asking, Okay. Show me. This is great. You showed us a bunch of slides. Google is great for providing a great interface for slides, but let’s see the product. What do SLOs do for us? I’m going to show this and dive into a real-world example showing Honeycomb SLOs to keep Honeycomb running. This is from our DevOps environment. We have a simple SLO defined here. It literally says if the request itself took under 1,100 milliseconds, we win. If it was greater than 1,100 milliseconds, bad. You chew away at that availability budget.
Our target is three nine eights over a rolling seven-day period. So over seven days, for every 10,000 requests, we’re allowed two bad. If we’re allowing two bad requests for every 10,000, this chart would display zero. It would display that you have no more available budget remaining. A lot of people may call this the error budget, the number of errors you’re allowed to produce, or the allowability you’re allowed to have left. You have a lot of that budget to burn away. We’re down and it’s pretty slim. This is a DevOps environment. Every so often, we burn it up and make it go down. You can see my line. It’s a pretty consistent trend falling over.
Now, you heard me earlier talk about you wanting to keep your engineers happy. You don’t want to wake them up if you don’t have to wake them up because a lot of times, we put the best engineers on call. We put on the ones that can solve the problems. Do you know what sucks? If you keep waking up an engineer at 3:00 in the morning, they get unhappy and they may want to leave, especially if they didn’t need to wake up at 3:00 a.m. So what we have here are exhaustion time alerts. Honeycomb will take a trend of a slow burn. It’s going to take this trend and forecast it out, 24 hours, four hours, eight hours, whatever is right for you in your world. This goes to Slack. This goes to pager duty.
If we detect at the current rate, if you kept this running, in the next 24 hours you would have burned through your budget. We’re going to put a message in Slack and see if somebody picks it up or puts it in a ticket item of some kind. But we have an incident about to happen or we’re currently experiencing one. I have a compliance chart here. This is just trying to show me how I’m staying compliant against my ratings. I’m generally speaking always above that 99.98. Sometimes we dip this line a little bit below. Another aspect of this is as an engineering manager. If you’ve got a 30-day SLO, in this space we’re always disrupting. We’re always adding new features. If you’re not the one disrupting, your competitor is going to disrupt you.
We’re always adding new features, and sometimes it’s risky. Sometimes when you add new stuff, it comes at the sacrifice of stability. As an engineering manager, if you’ve only got a couple of percent of your error budget remaining and this next new feature, you’re about to deploy is risky, you’ve got some data that you can use to say, Hey, can we pause that release for a couple of weeks just so we can shore up a few more stability things? I don’t want to risk violating our agreements. Or maybe you did violate your agreements and you’re about to roll new features, your data says, Let’s pause this and focus on platform stability for the next sprint.
Now, below this, if this is all we provided, it would be great, but why is the part of the tooling? Remember, you need to take the actionable information that you get. And this is where we’re getting the why. Down here, I have a heatmap showing all the events over the last 24 hours. We use dark blues to show you where the events are. And then because it’s an SLO screen, we add one more color to the heatmap. It’s yellow color. Here, these yellow colors, or gold, are the things that failed my measurement. These are the things that are chewing away at my error budget, or my availability budget.
30:27
Right here, if I took a look at all these yellow squares, and at Honeycomb we have this magic tool, I call it. It’s called BubbleUp. It will take all of what you’ve selected and tell you what’s different about it. And that’s what it’s telling me here. It’s telling me 92% percent of all those errors are coming from a single endpoint. That’s great. This will be the chart on your dashboard that’s the errors per endpoint. Hopefully, someone is looking at that chart. They might not have, though. They may see errors overall. You might just see a blip on there, probably not enough to move the needle. But if you’ve looked at it by endpoint, you’re all over that. Naming in our world is the endpoint with the HTP prefix to it. So, Git endpoints. Let me hover to this. Git V2 tickets export in this case.
You’re going to float the service name. Everybody looks at this. Right? Well, we can clearly see which service specifically is suffering right now. It’s our gateway service. Ninety-two percent of errors are happening right there. At Honeycomb, we talk about high cardinality fields, fields that have unique values, fields that break your values because you have tags and then the database, and things go bad, fielding we encourage you to send to Honeycomb about. Send in the IDs because you never know when it’s just one user or a couple of users having a bad time. Maybe you know something about the users. Maybe they belong to the same ISP. We were working with schools and certain schools wrote through an ISP, and they couldn’t get all the data in, so they were having problems. Well, now they can with Honeycomb.
Here I can see uploading a user ID, and 49% of the errors are the same user. That’s pretty telling. We can use this to drill down and learn more. I can use any of these charts and continue drilling, but I want to talk more about SLOs. I want to pull the page first. Sorry I didn’t pull this up earlier. I’m going to start with this. Honeycomb status page. Honeycomb.io, yeah, we’re operational. We’re great today, all green. Go. Right? Somebody asked a question: How are you doing Honeycomb? We’re doing great. For whom does it suck? Well, it doesn’t suck for anybody. We’re doing great. Is that true?
Here is our SLO for a service we call Shepherd. It ingests data into Honeycomb. This is the biggest thing we do. We have to take in your data. We have to service that, respond back very quickly, and take it in that data, get it ready for you to query. So we have our most stringent SLO right here. It’s four nines over 30 days. Now, the actual SLOs definition itself, it’s kind of long. All of this is qualification. All of it. Every line, except for the last two lines s qualifying the event. We don’t care about a lot of these events for this SLO because they’re user-generated and you did it wrong and you probably knew that you were doing it wrong and things we don’t want to support in that scenario.
The very last line, however, says equals response status code 200. So, really, once we dismiss a bunch of things, once we disqualified a bunch of kinds of events, for everything else, whether or not we return 200 to you, that’s what we care about. That’s what we’re doing. Over 30 days, we’re about 37% of our budget remaining. We’ve got time. We get a lot of events, well over 10,000 a second. So this is manageable. Now, my heatmap is really different. First off, my measurement here is all about error. It has nothing to do with latency. We’re really just caring about error here. Well, it’s hard to render errors in a heatmap. So we pick duration. I can change this field to other things if I had to. Here, I’m writing duration.
You can see I’ve got yellow all over the place. This is a really busy heatmap. To help with that, we do these things here where we do these little sparklines on the side to help you understand where the majority of your content is. Naturally, people may think to draw their lines at the top line here, but it turns out the top line only has a little bit of it versus what’s in the middle band right there. And the majority of my errors are happening right there, in the middle band. But, of course, I can come down here, and I can look at other aspects of this, and I can learn more about what is generating those errors.
35:30
Let’s start off with handler.route. Batch here is part of the baseline. It looks like 5% is hitting the batch endpoints. 5% of the errors are on a batch endpoint, but 95% is good but it’s the opposite when I look at the individual event receiving endpoint. Now, this is telling because the majority of our customers do use a batch endpoint. So it looks like someone is using the endpoint or someones, and they’re getting errors from it. Same thing here. Status code, I get it. We just care about things that are not 200. So this should be that, and that should be the remainder that’s left. Then we get into app error. Request body that’s too large. 79% had this error. Okay. That’s cool. Dataset slug. Now, in our world, this is an attribute specific to our world. It is the name that people send data to. They have to get a name. We call them datasets. The slug is the URLification of that name.
Now, I looked at this earlier. These may contain nice things in there, but this is a super generic name of “website.” We can tie it back to the original team that’s using it as well, but I have a good indication here. Seventy-one percent of the errors are going to the one dataset called “website.” This is really telling to us. I’m learning a lot. I’m learning I’m having a customer who is not having a good experience. My service is all green. For whom does it suck? It sucks for the person running the website. That’s a problem. What we do at Honeycomb, inside the customer field team, is we look at this on a periodic basis, and we try to see if we can find customers that are not having a great time.
Our engineers use it to keep the systems up and running. We use it to keep our customers happy. We get real business value from the SLOs that our engineers created. Now, we reached out to this team. We were unable to get ahold of them. It’s a free team. They probably left something running. You know, they did something wrong and forgot about it. But we have all other kinds of information about that. Like, I know which SDK they’re using. They’re using our Ruby Beeline to send in the data. We even have which IP address this is coming from so we could eventually deny it if we had to.
But this is all great information I’ve got and learned of a customer having a bad experience when we’re all green. And this is powerful. This is the part of SLOs that I love the most because it doesn’t just benefit the SRE team, not just the developers. It benefits the business people as well. And here is another one. So that one is about errors. This one is about errors and a duration combined together. So we’re kind of looking at this here. First off, if the users from Honeycomb.io. We don’t care. You don’t qualify. You’re probably doing something stupid anyways. Right? So we’re not even going to count that. Everyone else, we want to count those. We want to make sure you’re doing things that are allowed, and we want to make sure of the query. If the duration is under 250 milliseconds and you do not get a 400 response code or the response code is less than 400, then everything is great. This is important to note here.
Response codes are integer numbers. The higher the number, the worse the thing is. So we like to use that to use it as a threshold sometimes. Not just saying response code equals 200 but less than 400 helps you get there. Here, we’re doing pretty good. I don’t see errors in there. There’s a couple of dots here. I don’t have to catch them, but this is going to try to tell me where the dots are. Turns out we’re only having a couple of users who are not experiencing a really great time. And I could probably hover over each one of these. This line is 17. A couple more here. They’re all about the same thing. All this is telling us is that we had a couple of requests that went bad. If I came in here and see one bar, one user having a really bad time, I would be concerned. That would tell me that they’re trying to use the platform and not getting a lot of value from it. Again, we get business value out of this, but engineering could get value. Oh, we know why it’s happening. It’s this SDK or endpoint or our usage of this platform. So really great ways to allow you to get to where you want to go.
I’m going to pause there. We’ve done a lot of stuff, showing you how SLOs work, showing you how Honeycomb uses SLOs, and, also, the benefits of them and really how to get started. If you have any questions, I’m going to encourage you to plug them into the Q and A panel. Or chat. I’m good either way. We’re going to try to tackle them and answer them for you. I do want to come back and remind everybody about the poll. The one thing we got was unsure what questions to ask. One thing you see here, and you might have seen maybe in a consistent thread, latency and errors. That’s what your customers care about. How long did it take? And did you respond with an error code? They don’t care if your CPUs are hot. They don’t care about anything else. They care about the speed and if it’s successful. That’s what you should start with when you’re asking those questions.
I see another question here. I’m just going to repeat the question and the answer that we had. Thank you, Chuck, for going ahead and answering this, but I would like to let people know. So is there any evidence that adopting SLO correlates with higher customer satisfaction? Our way to answer this is in a roundabout fashion. It’s typically derived from better features, but a better way to measure this is when you’re down, your customers are not happy. When you’re not down, your customers are. So if you could minimize your downtime, you could probably increase your customer satisfaction there.
We, at Honeycomb, have found incidents where we’ve looked at data, and on the business side, we have found customers not having a great time. Sometimes they didn’t even know they were not doing great. We’ve reached out to them. They have absolutely enjoyed that experience. And I can only think they are now more satisfied by knowing that we reached out to them to say, Hey, you’re not doing it right. Let us help you out. Another question here: How does setting SLOs eliminate alert alarms? And there’s more about the impression of how they can configure. I’m going to take that. Triggers and alerts, we typically create a lot of them for symptoms, one for every single symptom you can think of. Then you will probably miss one because someone wrote a new feature.
SLOs are meant to focus on the core service itself. Sometimes a symptom can go off and it’s fine. Maybe your CPU is hot, something rebooted, something restarted, but it came back down and it’s normal again. You woke somebody up for no reason. When we talk about alert fatigue, it’s probably because you have symptom alarms going off all the time when one SLO alarm could have solved that for you. Chuck, you know if you want to chime in on that one as well, particularly with how the alert fatigue aspect helped with the SLO. We know SLO usage is going up at a much more rapid pace than a new trigger development.
Chuck Daminato:
One thing we’ve been able to do with SLOs is to tune down alarms. For example, we have a pretty quiet service where someone will request a piece of information on the backend that serves that up. It occasionally takes a long period of time. So it can time out. During the day, in the general flow of things, this is just nice. When it’s at night when things are quiet, one error can be like less than .1% of our requests are failing and we get paged for one error. And that makes no sense. So instead of having this really, really fine-tuned alert on this specific behavior, we use SLOs to say: Is this happening a lot? Is it happening over a broader period of time? Is it happening at such a rate that we should be taking a look at it? So we’re not getting woken up for an error or two here and there. We’re getting woken up if there are a lot of errors happening.
Pierre Tessier:
That’s awesome. Thanks, Chuck. I see another question, How can you learn more about SLOs in general? We do have some literature on Honeycomb. I’m probably going to screw up this URL, but I’m going to try to do it by heart. Www.Honeycomb.io/production–there it is. We’re going the leave this. This is the final slide. I should have shown this as well, but it’s a great start to learn about the SLO theory in super great detail. We even have more demos you can see on here as well. I encourage you to check this great resource out. I will also say, again, I keep on touting this, Honeycomb will have sessions that discuss SLOs, and, particularly, we’ll have more customers talking about their success using SLOs.
I see another question here about when the budget goes back up. All the charts are, you know, going down to the right. That’s right. Because this is how much availability you have. The only thing that allows the chart to go back up is if you had a sharp decrease prior, maybe in the past, and time rolled that decrease away from you. So you want the look at this as I’ve got a bucket, a moving time bucket of availability, or things that can go bad; and as time goes, they fall out of the bucket, whether they’re bad or good, and move forward, but bad events will take away from that bucket no matter what. The only thing that will allow the line to go back up is if really sharp decreases rolled out over time.
I see another question here. In an enterprise environment, there are thousands of services in general. Yes, alert teams do face alert fatigue. Can you imagine that? All the symptom alerts on thousands of services? That could be pretty hellacious. I understand that. And what do we have for suggestions on how to not add to that problem? Well, again, you wouldn’t want to start looking into replacing those symptom-specific alerts with service SLO-style alerts instead. Like I was saying about that large customer of ours that had 124 SLOs to find. Last week, they were under 110. They’re continuing to define and add more. Particularly, they had about 200 services so far sending data into Honeycomb. So a little bit less than one per service. Some of them were kind of combined with others, into one single SLO to cover it; but, generally speaking, that’s how we see people doing this. A couple of SLOs per service, and you’re off to the races.
I want to thank everybody for their time here. The questions are really good, and we do appreciate them. There are a couple of links here. If you want to learn more about Honeycomb and SLOs, certainly the bottom link, Honeycomb.io/productionslos. A really good blog I recommend many, many people to read is the Observability Maturity Model. This is a blog that was written, I believe, a couple of years ago by Charity Majors, the CTO at Honeycomb. It’s really, really good to understand how you uplevel, how you get that better SRE culture within your organization. Certainly, if you like any of this, we can work with you to do a 30 day trial of Honeycomb to allow you to get started, to play with this, to see if it works for you.
We understand sometimes people are have limited time and teams are small. It’s hard to get started. We’re happy to work with you hand in hand through your trial period. With that, I want to thank everybody for attending. Thank you very much. We do webinars often. I heard Charity call it Webinar Wednesday. I like that. I don’t know if she’s going to commit to doing these every Wednesday. I doubt it.
Sheridan Gaenger:
I don’t know, I’m feeling it. Be on the lookout for a webinar next Wednesday.
Pierre Tessier:
Thanks, everybody. Happy observing!
Sheridan Gaenger:
Happy observing. I’m just going to close out with just a couple of final housekeeping items. Just to remind everybody, we did record this session. Thank you for sticking with us for 50 minutes today. We’ll send the link to the recording as well as the slides post-event. Yes, a bee did tell you that we are sending out some swag for everybody who attended today. Be on the lookout for a redemption link as part of the follow-up. So, yes, you will get some Honeycomb swag as part of your participation today. And definitely look forward to the next webinar Wednesday. More to come there. As Pierre has mentioned a handful of times, we hope to see you at o11ycon and hnycon come June. So we’ll also add the link to the registration in our follow-up. All right? Thanks, Chuck. Thanks, Pierre. Thanks, everybody, for joining.