Tracing Refinery

By: Tyler Helmuth | January 22nd, 2025

Sampling Tracing

5 Min. Read

Contents

What was the problem with the collect loop?

But why?!

Welp, there’s our culprit. But still, why?!

Tracing saves the day

Webinars September 13, 2022

Options for Managing Telemetry Volume At Scale & How Slack Does It

You’ll learn the pros and cons of different sampling techniques and how to retain granular visibility into your system state. We discuss why sampled events are better than the traditional method of pre-aggregated metrics, brought to life through a demo of Refinery.

LEARN MORE

We recently released Refinery 2.9, which came with great performance improvements. Reading through the release notes, I felt the need to write a piece on this improvement, as it’s quite important but easy to overlook: collect loop taking too long. This is the story of how we used distributed tracing to find the slowdown in this loop.

What was the problem with the `collect` loop?

We were alerted to this performance issue by a customer. Their Refinery instances were handling a lot of traffic—more than our own Refineries—and they noticed their Refinery pods getting killed by Kubernetes as they were failing liveness checks (Refinery expects two of its internal loops, stress relief and collect, to report back within a fixed time limit—else, they report as unhealthy).

We knew from Refinery’s logs that the collect loop was taking too long. To solve the problem temporarily, we released Refinery 2.8.3 and 2.8.4, which respectively allowed users to configure this health check’s timeout and added a new metric to record how long the collect loop took.

This solved the health check issue, but we were still surprised by the results of the new metric: it showed this customer’s collect loop sometimes took over 20 seconds for a single cycle! With the new metric, we noticed the same problem in our own cluster:

But why?!

We needed to know why this loop sometimes took so long—and a metric wasn’t going to be enough, as it couldn’t give us the details of what was actually happening inside the loop. We decided to instrument the collect loop with distributed tracing.

The collect loop is Refinery’s primary work loop. During each pass, Refinery may take one of several actions:

Pull a span out of the incoming queue and process it
Make sampling decisions and export spans
Reload configs

We needed to know which action(s) caused the slowdown.

We used OpenTelemetry to instrument the loop. With tracing, we could see the duration of the entire loop, but also the duration of any action the loop took. Additionally, we could easily record metadata about the cycle, such as the number of traces being expired, the number of spans being exported, or the reason for a sampling decision.

We were concerned about the amount of data this would produce, since we’d get a trace with many spans every time it went through the neverending for loop, so we used head sampling to reduce the amount of data we emitted. Unfortunately, this resulted in unhelpful data. The loop didn’t always take a long time, and when head sampling, those traces would sometimes get discarded. We needed tail sampling—we needed Refinery. However, using the Refinery instance to tail sample its own traces surfaced an issue: if it was taking a long time and we sent it even more data, it could fall over. So we spun up a second temporary Refinery cluster to sample our main Refinery cluster’s traces.

With this setup, we were able to guarantee that any trace longer than one second was kept.

Once the traces were tail sampled, it took a single query and BubbleUp to see what was taking a long time: sendExpiredTracesInCache.

Screenshot of BubbleUp anomaly detection showing what took such a long time in our collect loop.

Welp, there’s our culprit. But still, why?!

Thanks to the attributes we recorded, we could see a correlation between long collect loops and the number of traces that we were processing in sendExpiredTracesInCache. This function handles sampling decisions (i.e.: dropping or exporting traces). The more traces we processed, the more work needed to be done, and the longer it took.

At this point, we knew the culprit—but we still didn’t know why sendExpiredTracesInCache sometimes took a long time.

The traces showed us that sometimes, the decision for a trace would take 40ms—and other times, it would take 275ms. We knew something was going on in the decision-making process, but we weren’t sure what.

Refinery traces sometimes took over 275ms.

To find the problem, we leveraged distributed tracing once more.

We added more instrumentation deeper in the collect loop, this time trying to record which sampler was used to make sampling decisions when these decisions took a long time. With that attribute in place, we found our suspect: EMADynamicSampler.

Now that we knew which sampler to investigate, we went searching through its code and found locks.

It uses these locks when recalculating the sample weight, but also when being asked for a sampling decision. It turns out the sampling decisions were sometimes being blocked while the sampler recalculated its weights.

To resolve this issue, we opted to allow users to configure a maximum on how many expired traces could be processed during a single loop pass. We again used tracing to confirm this solution worked. As we set different values, we could correlate trace durations and the number of expired traces. The result was a new, proper default value of 5000—which, according to our tests, capped the collect loop at a max of three seconds.

Tracing saves the day

Without tracing, it would have taken much longer to find these results. With a good tail sampler like Refinery, we were able to ingest every interesting trace while sampling the uninteresting ones, saving ourselves money and resources. This improvement (and more!) are now available in Refinery 2.9.

Speak to our experts to try Refinery today!

let’s chat

Don’t forget to share!

Tyler Helmuth

Staff Software Engineer

Tyler is a software engineer with a passion for observability and helping users start their observability journey. He is an active contributor to OpenTelemetry, where he strives to make observability easy for all to achieve.

Irving Popovetsky | Jul 09, 2025

Honeycomb Users Are Living in the Future, Part 1: Sampling

When we talk to new Honeycomb users, a few things stand out as sounding downright magical. Sometimes we'll hear, "Wow, is that a new feature?" and we'll say that no, it's been like that for years. Clearly we need to get the word out!

Observability Sampling

Charity Majors | Apr 23, 2025

How Much Should I Be Spending On Observability?

In last week’s piece, we talked about some of the factors that are driving costs up, both good and bad, and about whether your observability bill is (or should be) more of a cost center or an investment. In this piece, I’m going to talk more in depth about cost drivers and levers of control.

Observability Sampling

Irving Popovetsky | Apr 21, 2025

Data Strategy for SREs and Observability Teams

The idea that telemetry data needs to be managed, or needs a strategy, draws a lot of inspiration from the data world (as in, BI and Data Engineering). Your company most likely has a data team that manages the data warehouse(s), data pipelines, data sources, and reporting tools. These teams are also constantly balancing costs with their user and stakeholder needs, usability, data retention, granularity, etc. Sound familiar? That’s because if you’re working on observability data, these teams are at least several years ahead of you in addressing these tradeoffs and considerations—and can teach us quite a lot.

Observability Sampling Software Engineering

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Tracing Refinery

Options for Managing Telemetry Volume At Scale & How Slack Does It

What was the problem with the `collect` loop?

But why?!

Welp, there’s our culprit. But still, why?!

Tracing saves the day

Tyler Helmuth

Related posts

Honeycomb Users Are Living in the Future, Part 1: Sampling

How Much Should I Be Spending On Observability?

Data Strategy for SREs and Observability Teams

Ready to get started?