Product Videos SLOs Observability Debugging
What is Honeycomb.io?
See how Honeycomb turns production software from a black box into a fountain of insight about code and customer experience. This demo takes you from an SLO alert to quick isolation of the exact services and deploy causing latency issues.
Transcript
Jessica Kerr [Developer Advocate|Honeycomb]:
Oh, hey, something’s going on in production with a checkout endpoint. I got an alert from Honeycomb. Look at this Service Level Objective. At this rate of failure, our error budget will be used up in about four hours. The problem shows up on this latency graph. Some checkout requests are unacceptably slow. What’s taking so long? Let’s look closer. I can zoom in on when it got slow. Each of these dots represent some quantity of requests. Higher is slower. Darker is more. I click on a slow dot to see an example. Honeycomb shows the story of each request as a trace. See it coming in at the front end service through checkout and cart and product catalog. Oh, this one span of time took the longest, over a second inside “get discounts”. A Honeycomb trace makes it clear where the time is going. What is it doing over and over?
Honeycomb has no limit on the fields we can send and see. I can see the SQL query it’s running. It looks like it’s doing the same one a lot. Is that normal? Let’s go back and look at the slow requests in aggregate. What is different about them? Honeycombs BubbleUp finds distinguishing factors. I select the ones that look weird and Honeycomb looks at all the fields and tells me which ones might be causal. Yellow bars for the events I picked and dark blue is everything else. What is different? Discount code? It looks like all the slow requests have a single discount code. Why would that matter? Let’s drill into that. Group by field. Honeycomb let’s me graph whatever I want, whenever I want, and it’s always fast. Even on fields you never expected to choose. I want to see the different discount codes and I want to see the 90th percentile of request latency.
I can see the P90 shoot up from about one second to over two seconds, but it’s only for one discount code. The other ones are fine. What caused this? I can click this graph to look at another example trace. The problem seems to be in “get discounts.” Zoom into that. Oh, look. Here in the checkout service, I see a marker. It says a deploy happened just before this slowness. Now, I know what I need to do. I’ll revert this deploy and I know where to look deeper, in the “get discounts” function in the code and this particular troublesome data. Honeycomb turns production software from a black box into a fountain of insight about code and customer experience.
If you see any typos in this text or have any questions, reach out to team@honeycomb.io.