Guides Observability Engineering Best Practices
How Refinery Helps With Sampling Complex Event Data
15 minute read
Sampling is the practice of extracting a subset of data from a dataset to make conclusions about that larger dataset. It’s far from a perfect solution, but when it’s implemented with Refinery, Honeycomb’s trace-aware sampling proxy, sampling can help you manage very high volumes of complex event data.
GOAT, an e-commerce platform focused on designer sneakers, apparel, and accessories, is a great example of a company that needs an effective sampling solution. In his talk at the 2021 hnycon, Kevan Carstensen, Backend Engineer at GOAT, explained that GOAT’s small backend team is tasked with managing customer-facing services with an extremely high volume of requests, as well as internal tools with a handful of stakeholders and much fewer requests.
Sampling with Refinery helps Kevan and his team manage this huge event volume, cut through the noise, and resolve issues quickly. But sampling isn’t necessarily right for everyone and implementing it correctly at your organization will be a unique process that depends on your infrastructure and team.
Implementing sampling with Refinery is a complex decision and a unique process, so let’s follow GOAT’s journey for implementing sampling with Refinery and pull out important lessons you can use to do the same.
But first: Should you sample?
“Don’t sample by default,” is a maxim Kevan mentioned in his talk. For some businesses, this maxim is true, but following it blindly without understanding the nuances may cause you to miss out on sampling as a valuable tool.
The benefits of sampling
Sampling is useful when you have a very high event volume. Whether it’s due to a large number of traces, very dense traces (i.e., traces with a lot of events), or both, a high event volume is much more manageable with sampling because instead of having to dig through a huge influx of data, you’re only dealing with a representative sample. Kevan shared that, for GOAT, their higher-volume applications can generate 150 million or more traces per day.
Because instead of having to dig through a huge influx of data, you’re only dealing with a representative sample.
— Kevan Carstensen, Backend Engineer at GOAT
Sampling is great for when you have events that are irrelevant most of the time. Kevan shared an example from GOAT: “We have a good deal of instrumentation on how Rails catches things. And for the one out of a hundred incidents where this is important, it’s really cool to have [sampling]. It helps us visualize whether our caching is working. But it’s also incredibly noisy.
There are a lot of events that come from these [incidents], and, in most cases, you’re not going to be looking at them.”
Sampling effectively can provide great visibility while keeping costs manageable. As Kevan explained, a lot of GOAT’s applications are very well instrumented, which gives them great visibility into both database and cache access. “We can see so much of what’s going on in our applications when we’re looking for issues, but it also means each trace can have dozens or even hundreds of events,” Kevan said. “Sending each trace into Honeycomb and having them retained in our dataset are cost prohibitive, so we need to sample.”
What to be aware of when using sampling:
If set up incorrectly, sampling can increase cognitive load. If not set up correctly, anything that samples metrics in your application may not directly reflect the reality of what’s happening. This discrepancy can increase cognitive load for your developers at critical times. Kevan gave
a hypothetical example of a 3 a.m. call about an important issue where, if you use sampling, you’ll need to think through “Are these metrics being sampled? Am I actually seeing the true frequency of how often this problem is happening?“
Sampling infrastructure is another thing to maintain. You need to account for the fact that someone has to maintain your sampling infrastructure. For larger organizations that have dedicated infrastructure teams, this is less of a problem. When it comes to smaller teams like GOAT, the responsibility of maintaining sampling infrastructure will fall on people who have many other responsibilities as well.
Sampling with Refinery to manage high-volume event data
Refinery is Honeycomb’s trace-aware sampling proxy that collects spans emitted by your application, gathers them into traces, examines them as a whole, then makes a decision about what events to keep.
This functionality helps organizations like GOAT manage complex, high-volume event data using sampling because, as Kevan described, “A sampling decision will apply to every event within a trace and not just on individual events. You don’t have to worry about missing spans or certain types of events that didn’t make it into the trace.”
Refinery helps GOAT use sampling for its benefits while mitigating its drawbacks because:
- It helps reduce the cognitive load sampling may inflict since both the trace and the event will make it to the dataset.
- It’s easy to integrate into the tech stack, making the sampling infrastructure simpler to maintain.
- It’s designed to help manage particularly large and dense event volumes that often contain irrelevancies.
GOAT’s process for implementing Refinery
The process for implementing Refinery will be unique to your organization’s infrastructure and goals. For GOAT, Kevan explained that they adapted Refinery to run in their internal Platform as a Service (PaaS), then tuned it to meet their load.
GOAT’s backend and observability systems
GOAT’s backend uses Ruby on Rails or Golang and has dozens of different services, traffic patterns, and types of requests. To keep track of all of this, GOAT uses a mix of observability tools, including:
- Bugsnag for exception reporting and grouping unhandled errors.
- StatsD and CloudWatch for metrics provided by the infrastructure for monitoring of CPU utilization, memory utilization, etc.
- Grafana, Elk, and Kibana pipeline for monitoring metrics from within the application.
- Honeycomb for tracing and the ability to visualize the execution of a single request.
We have a number of alerts and service level objectives (SLOs) that are keyed off Honeycomb data. This is really useful for visualizing bottlenecks and is key for engineers who are new to the industry, new to the company, or both.
— Kevan Carstensen, Backend Engineer at GOAT
How GOAT deployed Refinery
- Dockerize Refinery for GOAT’s environment.
- Port legacy Refinery rules to their hosted Refinery. GOAT had the advantage of already having used a legacy version of Refinery for a while, so they already had quite a few rules in place to port over.
- Test and measure how Refinery reacted. GOAT regularly performed load testing on their applications that replicated the holiday gift-shopping rush. Working closely with the team performing the load test, Kevan’s team made sure their Refinery setup could keep up by changing replication and provisioning requirements