Sampling is the practice of extracting a subset of data from a dataset to make conclusions about that larger dataset. It’s far from a perfect solution, but when it’s implemented with Refinery, Honeycomb’s trace-aware sampling proxy, sampling can help you manage very high volumes of complex event data.
GOAT, an e-commerce platform focused on designer sneakers, apparel, and accessories, is a great example of a company that needs an effective sampling solution. In his talk at the 2021 hnycon, Kevan Carstensen, Backend Engineer at GOAT, explained that GOAT’s small backend team is tasked with managing customer-facing services with an extremely high volume of requests, as well as internal tools with a handful of stakeholders and much fewer requests.
Sampling with Refinery helps Kevan and his team manage this huge event volume, cut through the noise, and resolve issues quickly. But sampling isn’t necessarily right for everyone and implementing it correctly at your organization will be a unique process that depends on your infrastructure and team.
Implementing sampling with Refinery is a complex decision and a unique process, so in this blog, we’ll detail GOAT’s journey for implementing sampling with Refinery and highlight important lessons you can use to do the same.
But first: Should you sample?
“Don’t sample by default,” is a maxim Kevan mentioned in his talk. For some businesses, this maxim is true, but following it blindly without understanding the nuances may cause you to miss out on sampling as a valuable tool in your toolkit.
The benefits of sampling:
Sampling is useful when you have a very high event volume. Whether it’s due to a large number of traces, very dense traces (i.e., traces with a lot of events), or both, a high event volume is much more manageable with sampling because instead of having to dig through a huge influx of data, you’re only dealing with a representative sample.
Sampling is great for when you have events that are irrelevant most of the time. Kevan shared an example from GOAT: “We have a good deal of instrumentation on how Rails catches things. And for the one out of a hundred incidents where this is important, it’s really cool to have [sampling]. It helps us visualize whether our caching is working. But it’s also incredibly noisy. There are a lot of events that come from these [incidents], and, in most cases, you’re not going to be looking at them.”
Sampling effectively can provide great visibility while keeping costs manageable. As Kevan explained, a lot of GOAT’s applications are very well instrumented, which gives them great visibility into both database and cache access. “We can see so much of what’s going on in our applications when we’re looking for issues, but it also means each trace can have dozens or even hundreds of events,” Kevan said. “Sending each trace into Honeycomb and having them retained in our dataset are cost prohibitive, so we need to sample.”
What to be aware of when using sampling:
If set up incorrectly, sampling can increase cognitive load. If not set up correctly, anything that samples metrics in your application may not directly reflect the reality of what’s happening. This discrepancy can increase cognitive load for your developers at critical times. Kevan gave a hypothetical example of a 3 a.m. call about an important issue where, if you use sampling, you’ll need to think through questions like, “Are these metrics being sampled? Am I actually seeing the true frequency of how often this problem is happening?“
Sampling infrastructure is another thing to maintain. You need to account for the fact that someone has to maintain your sampling infrastructure. For larger organizations that have dedicated infrastructure teams, this is less of a problem. When it comes to smaller teams like GOAT, the responsibility of maintaining sampling infrastructure will fall on people who have many other responsibilities as well.
Sampling with Refinery to manage high-volume event data
Refinery is Honeycomb’s trace-aware sampling proxy that collects spans emitted by your application, gathers them into traces, examines them as a whole, then makes a decision about what events to keep.
This functionality helps organizations like GOAT manage complex, high-volume event data using sampling because, as Kevan described, “A sampling decision will apply to every event within a trace and not just on individual events. You don’t have to worry about missing spans or certain types of events that didn’t make it into the trace.”
Refinery helps GOAT use sampling for its benefits while mitigating its drawbacks because it:
- Helps reduce the cognitive load sampling may inflict since both the trace and the event will make it to the dataset.
- Integrates easily into the tech stack, making the sampling infrastructure simpler to maintain.
- Is designed to help manage particularly large and dense event volumes that often contain irrelevancies.
GOAT’s process for implementing Refinery
The process for implementing Refinery will be unique to your organization’s infrastructure and goals. For GOAT, Kevan explained that they adapted Refinery to run in their internal Platform as a Service (PaaS), then tuned it to meet their load.
GOAT’s backend and observability systems
GOAT’s backend uses Ruby on Rails or Golang and has dozens of different services, traffic patterns, and types of requests. To keep track of all of this, GOAT uses a mixture of tools, including:
- Honeycomb for tracing and the ability to visualize the execution of a single request.
- Bugsnag for exception reporting and grouping unhandled errors.
- StatsD and Cloudwatch for metrics provided by the infrastructure for monitoring of CPU utilization, memory utilization, etc.
- Grafana, Elk, and Kibana pipeline for monitoring metrics from within the application.
Kevan’s three takeaways from using Refinery
Kevan and his team used Refinery as a way to get better visibility into their production environment while keeping costs low. Here are some takeaways he has for those looking to do the same with Refinery:
-
- Refinery is very low maintenance. Kevan shared that, “Once Refinery is running it’s very stable. It just keeps working.” Because Refinery just works, you can turn your attention to what sampling is telling you instead of getting it to work right.
- Honeycomb itself is very useful for tuning. GOAT uses Refinery with Honeycomb’s Usage Center. “I can see the sample rate very easily,” Kevan said. “I can see whether Refinery is exhausting some memory buffers or something that points to more provisioning that’s needed.” Beyond sample rate and memory buffers, Refinery also provides visibility into things like database and cache access.
- Rules-based sampling makes the best of the quota and your budget. “It’s great having rules that allow us to very precisely indicate the rate of samples that we want. It lets us get the most of that event quota,” he explained.
Improve your sampling with Refinery
If your organization has a high volume of complex, dense event data, Refinery may be the right choice for you to help keep costs within budget while providing great visibility into your production environment. You can get started with Honeycomb today with a free Enterprise trial and begin your sampling journey just like Kevan and his team at GOAT.