Refinery 2.9: A Love Letter to Refinery’s Operators

Refinery 2.9: A Love Letter to Refinery’s Operators

5 Min. Read

Refinery is a powerful tail-based sampler—but with great power comes great challenges. We heard your feedback and are excited to announce the release of Refinery 2.9, a rather large update that is packed with goodies to make your life easier when running Refinery in your network.

Configuration

As our customers continue to scale their usage of Refinery, it’s often challenging for them to provision their Refinery cluster at the “right” size that both meets their performance requirements and is cost effective.

Before 2.9

Refinery saved all incoming spans that belong to a trace in memory using a circular buffer cache. The size of this cache was fixed and was determined by the number of traces that Refinery was configured to store. When the cache filled up, Refinery would evict the traces based on arrival time and memory consumption to make room for new traces.

The size of the cache was configured through the option CacheCapacity, in Refinery’s configuration file. Traces from different services can have a wide range of sizes, so it was difficult to choose the right size for the cache where it would accommodate traces of all sizes without evicting traces too early—or consuming too much memory.

Solution in 2.9

To better adjust to unpredictable traffic patterns, we revamped the trace cache to be more adaptive to incoming traffic—it is now able to dynamically adjust its size based on this traffic. It removes the need for users to manually configure the cache size. 

As such, we have now deprecated the CacheCapacity option in the configuration file. Before you remove this configuration from your Refinery config, make sure you have explicitly set PeerQueueSize and IncomingQueueSize since they were previously derived from CacheCapacity.

Stability

It’s important for operators to have confidence in Refinery to perform routine updates or node replacement. In this release, we fixed an important bug and as a result, improved stability and minimized service disruption during maintenance activities.

Before 2.9

By default, all spans from a single trace were forwarded to a single Refinery so that all the information for making a trace decision was in one place. This was achieved through a sharding algorithm based on the trace ID. When a new span arrived at Refinery, the trace ID determined which Refinery node the span should be routed to.

During a scaling event, the destination of a trace can change since the number of nodes available changes. This can result in the trace being processed by a different Refinery node than before. Different Refinery nodes might make different sampling decisions for the same trace, causing an incomplete trace to be sent to Honeycomb.

In version 2.8, we introduced a feature that supports trace redistribution when the number of nodes changes in a cluster. Refinery will recalculate trace ownership for all in-flight traces when it receives a cluster membership change signal.

This feature works well for customers with smaller size of clusters. However, when multiple cluster membership changes happen in quick succession, the recalculation can create a storm of rerouting traces that can cause a spike in traffic within the cluster and can potentially lead to Refinery DDoSing itself.

Solution in 2.9

To address the issue of rerouting trace storms, we introduced RedistributionDelay, a new configuration option that allows operators to configure a delay before Refinery recalculates trace ownership. This delay allows the operator to control the frequency of trace redistribution and prevent a storm from happening.

Scalability

With the new trace cache implementation, we’re hopeful this change to memory consumption will lead to a more reliable way to automatically scale a Refinery cluster, such as with Kubernetes horizontal pod autoscaling.

However, as previously mentioned, Refinery by default still routes all spans of a trace to the same node. This can lead to uneven load distribution across the cluster, especially for customers with exceptionally large traces. To address this, we have introduced a new experimental feature in 2.9 called TraceLocalityMode.

When TraceLocalityMode is set to distributed, spans of a trace will be distributed across all Refinery nodes in the cluster. This can help balance the load across the cluster and reduce the chance of a single Refinery node being overloaded. While in this mode, only necessary information needed for making a sampling decision is transmitted between Refinery nodes, effectively reducing the network traffic between the nodes. Sampling decisions are then shared between the nodes using a gossip protocol through Redis pubsub.

This feature is experimental and is disabled by default (the default value is concentrated). We encourage operators to test it in a staging environment before enabling it in production. We will be working on it in our own clusters and expect to release more guidance in the future.

Conclusion

We are excited to share these improvements with you and hope that they make your experience running Refinery smoother. Your feedback is what helps make us better, so we’d love to hear about your experiences with this update. Please file an issue on our GitHub repository, or reach out through Pollinators, our Slack community.


New to Honeycomb? Get your free account today!


Don’t forget to share!
Yingrong Zhao

Yingrong Zhao

Senior Software Engineer

Yingrong Zhao is a Software Engineer who enjoys problem-solving and learning / working with distributed system. When they are not coding, you can find them in rock climbing gyms or hole-in-the-wall restaurants.

Related posts