There And Back Again: A Honeycomb Tracing Story

By: Ian Wilkes | July 10th, 2018

Databases Dogfooding

3 Min. Read

In our previous post about Honeycomb Tracing, we used tracing to better understand Honeycomb’s own query path. When doing this kind of investigation, you typically have to go back and forth, zooming out and back in again, between your analytics tool and your tracing tool, often losing context in the process.

grover from sesame street/near-far skit

To avoid this annoying and time-consuming context-switch, use Honeycomb to do everything you need! For example, here’s how I diagnosed a single slow query:

detail from long trace screenshot 95 seconds is a long time! Why was this query so slow? To find out, I’ll click on the trace ID and dig into the waterfall.

setting the color of the trace lines Most of the work is being done by the retriever service, and in this case the query is distributed across six hosts. To help distinguish them, I’ll tell the UI to color based on the default hostIPv4 column.

Rolling up the major spans, I get a pretty clear breakdown of where the time went:

screenshot of long trace waterfall diagram
Here we can see that the top-level Fetch took 95.152s, which includes FetchPartialPhaseOne for all of the underlying hosts. One of these took 95.144s, so that’s obviously the bottleneck, and, oddly, took a full 40 seconds longer than the next slowest host, for what should be an evenly distributed workload. So I’ll open up that span and take a look inside.

Most of the interior spans are processSegment, each of which corresponds to loading about a million records from storage. They’re pretty quick (there are dozens in the 300-700ms range), until suddenly they aren’t.

segment of a trace waterfall diagram

Why does segment processing increase from ~600ms to ~15s?

To find out, I can click on the spans to take a look at their metadata. Here we see the value of adding extra fields to your spans; that cold: true field means this segment has rolled to Secondary Storage, which isn’t expected to be as fast. This query took a long time because it went far beyond the dataset’s Fast Query window.

While I’m here, I can check to see if the host in question was facing memory pressure. A mem_alloc of about 166MB means we’re not using much RAM, so that’s not an issue here.

But that doesn’t explain why this host in particular took so much longer than the others. To figure that out, I’ll return to Honeycomb’s normal aggregate view and start asking questions about this trace in particular.

query breaking down by process

It doesn’t take long to find a smoking gun. Each host processed exactly 77 segments, but the number on Secondary Storage varies. In the case of the slow host, there are 8 more than the next highest. Given how long they take to read, this explains the 40 second discrepancy.

screenshot of results showing discrepancy in timing
So now I have a clear answer for why my query took 95 seconds, despite only reading a mere half billion records. First, the query’s time range ran into Secondary Storage, and second, we have some inconsistency in how we age data out to secondary. I’ve learned all of this from a single tool and a single dataset. No need to switch back and forth!

What have you learned from Honeycomb Tracing? Drop us a note and maybe we’ll send you something in the mail 🙂

Don’t forget to share!

Ian Wilkes

Principal Software Engineer

Ian is a software engineer with 20 years of experience in the field.

Winston Hearn | Oct 02, 2024

Using Honeycomb for Frontend Observability to Improve Honeycomb

Recently, we announced the launch of Honeycomb for Frontend Observability, our new solution that helps frontend developers move from traditional monitoring to observability. What this means in practice is that frontend developers are no longer limited to a metrics view of their app that can only be disaggregated in a few dimensions. Now, they can enjoy the full power of observability, where their app collects a broad set of data as traces to enable much richer analysis of the state of a web service.

Dogfooding Frontend

Lex Neva | Aug 26, 2024

Always. Enable. Keepalives.

As part of our recent failure testing project, we ran into an interesting failure mode involving the OpenTelemetry SDK for Go. In this post, we’ll show you why our apps stopped sending telemetry for over 15 minutes and how we enabled keepalives to prevent this kind of failure from happening in the future.

Debugging Dogfooding Software Engineering

Fred Hebert | Jul 29, 2024

Making Room for Some Lint

It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up influencing the shape of the corrective items (if any) that get created. I’ll cover these ideas by using our June 3rd incident where a database migration caused a large outage by locking up a shared database and making it run out of connections.

Dogfooding Incident Response Software Engineering

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

There And Back Again: A Honeycomb Tracing Story

Ian Wilkes

Related posts

Using Honeycomb for Frontend Observability to Improve Honeycomb

Always. Enable. Keepalives.

Making Room for Some Lint

Ready to get started?