Refinery 2.9: A Love Letter to Refinery’s Operators

Sampling

By: Yingrong Zhao | December 10th, 2024

Sampling

5 Min. Read

Refinery is a powerful tail-based sampler—but with great power comes great challenges. We heard your feedback and are excited to announce the release of Refinery 2.9, a rather large update that is packed with goodies to make your life easier when running Refinery in your network.

Configuration

As our customers continue to scale their usage of Refinery, it’s often challenging for them to provision their Refinery cluster at the “right” size that both meets their performance requirements and is cost effective.

Before 2.9

Refinery saved all incoming spans that belong to a trace in memory using a circular buffer cache. The size of this cache was fixed and was determined by the number of traces that Refinery was configured to store. When the cache filled up, Refinery would evict the traces based on arrival time and memory consumption to make room for new traces.

The size of the cache was configured through the option CacheCapacity, in Refinery’s configuration file. Traces from different services can have a wide range of sizes, so it was difficult to choose the right size for the cache where it would accommodate traces of all sizes without evicting traces too early—or consuming too much memory.

Solution in 2.9

To better adjust to unpredictable traffic patterns, we revamped the trace cache to be more adaptive to incoming traffic—it is now able to dynamically adjust its size based on this traffic. It removes the need for users to manually configure the cache size.

As such, we have now deprecated the CacheCapacity option in the configuration file. Before you remove this configuration from your Refinery config, make sure you have explicitly set PeerQueueSize and IncomingQueueSize since they were previously derived from CacheCapacity.

Stability

It’s important for operators to have confidence in Refinery to perform routine updates or node replacement. In this release, we fixed an important bug and as a result, improved stability and minimized service disruption during maintenance activities.

Before 2.9

By default, all spans from a single trace were forwarded to a single Refinery so that all the information for making a trace decision was in one place. This was achieved through a sharding algorithm based on the trace ID. When a new span arrived at Refinery, the trace ID determined which Refinery node the span should be routed to.

During a scaling event, the destination of a trace can change since the number of nodes available changes. This can result in the trace being processed by a different Refinery node than before. Different Refinery nodes might make different sampling decisions for the same trace, causing an incomplete trace to be sent to Honeycomb.

In version 2.8, we introduced a feature that supports trace redistribution when the number of nodes changes in a cluster. Refinery will recalculate trace ownership for all in-flight traces when it receives a cluster membership change signal.

This feature works well for customers with smaller size of clusters. However, when multiple cluster membership changes happen in quick succession, the recalculation can create a storm of rerouting traces that can cause a spike in traffic within the cluster and can potentially lead to Refinery DDoSing itself.

Solution in 2.9

To address the issue of rerouting trace storms, we introduced RedistributionDelay, a new configuration option that allows operators to configure a delay before Refinery recalculates trace ownership. This delay allows the operator to control the frequency of trace redistribution and prevent a storm from happening.

Scalability

With the new trace cache implementation, we’re hopeful this change to memory consumption will lead to a more reliable way to automatically scale a Refinery cluster, such as with Kubernetes horizontal pod autoscaling.

However, as previously mentioned, Refinery by default still routes all spans of a trace to the same node. This can lead to uneven load distribution across the cluster, especially for customers with exceptionally large traces. To address this, we have introduced a new experimental feature in 2.9 called TraceLocalityMode.

When TraceLocalityMode is set to distributed, spans of a trace will be distributed across all Refinery nodes in the cluster. This can help balance the load across the cluster and reduce the chance of a single Refinery node being overloaded. While in this mode, only necessary information needed for making a sampling decision is transmitted between Refinery nodes, effectively reducing the network traffic between the nodes. Sampling decisions are then shared between the nodes using a gossip protocol through Redis pubsub.

This feature is experimental and is disabled by default (the default value is concentrated). We encourage operators to test it in a staging environment before enabling it in production. We will be working on it in our own clusters and expect to release more guidance in the future.

Conclusion

We are excited to share these improvements with you and hope that they make your experience running Refinery smoother. Your feedback is what helps make us better, so we’d love to hear about your experiences with this update. Please file an issue on our GitHub repository, or reach out through Pollinators, our Slack community.

New to Honeycomb? Get your free account today!

TRY NOW

Don’t forget to share!

Yingrong Zhao

Senior Software Engineer II

Yingrong Zhao is a Software Engineer who enjoys problem-solving and learning / working with distributed system. When they are not coding, you can find them in rock climbing gyms or hole-in-the-wall restaurants.

Irving Popovetsky | Jul 09, 2025

Honeycomb Users Are Living in the Future, Part 1: Sampling

When we talk to new Honeycomb users, a few things stand out as sounding downright magical. Sometimes we'll hear, "Wow, is that a new feature?" and we'll say that no, it's been like that for years. Clearly we need to get the word out!

Observability Sampling

Charity Majors | Apr 23, 2025

How Much Should I Be Spending On Observability?

In last week’s piece, we talked about some of the factors that are driving costs up, both good and bad, and about whether your observability bill is (or should be) more of a cost center or an investment. In this piece, I’m going to talk more in depth about cost drivers and levers of control.

Observability Sampling

Irving Popovetsky | Apr 21, 2025

Data Strategy for SREs and Observability Teams

The idea that telemetry data needs to be managed, or needs a strategy, draws a lot of inspiration from the data world (as in, BI and Data Engineering). Your company most likely has a data team that manages the data warehouse(s), data pipelines, data sources, and reporting tools. These teams are also constantly balancing costs with their user and stakeholder needs, usability, data retention, granularity, etc. Sound familiar? That’s because if you’re working on observability data, these teams are at least several years ahead of you in addressing these tradeoffs and considerations—and can teach us quite a lot.

Observability Sampling Software Engineering

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Refinery 2.9: A Love Letter to Refinery’s Operators

Successful Sampling With Refinery

Configuration

Before 2.9

Solution in 2.9

Stability

Before 2.9

Solution in 2.9

Scalability

Conclusion

Yingrong Zhao

Related posts

Honeycomb Users Are Living in the Future, Part 1: Sampling

How Much Should I Be Spending On Observability?

Data Strategy for SREs and Observability Teams

Ready to get started?