Refinery and EMA Sampling

Refinery and EMA Sampling

8 Min. Read

Refinery is Honeycomb’s sampling proxy, which our largest customers use to improve the value they get from their telemetry. It has a variety of interesting samplers to choose from. One category of these is called dynamic sampling. It’s basically a technique for adjusting sample rates to account for the volume of incoming data—but doing so in a way that rare events get more priority than common events.

Honeycomb’s query engine can compensate for sampling rates on a per-event basis. If you sample common events at 1 in 100 (because common events are similar to each other, and thus it’s less important to keep as many of them), and keep nearly every rare event, then on balance you are increasing the value of the telemetry you keep.

Refinery lets you create rules to do this explicitly using a rules-based sampler, but what if your data varies over time? That’s where dynamic sampling comes in.

Refinery’s dynamic sampling

When you configure a dynamic sampler in Refinery, you give it a list of “key fields” that you use to distinguish the different types of events you process. Good key fields are fields that have a relatively small set of unique values (this is known as “cardinality”) and are good at distinguishing between different types of events. For example, in a web service, you might use the request URL and the response status code as key fields.

Refinery watches for a while (a time called the AdjustmentInterval) and keeps track of the combinations of these key fields, as well as how many times each combination occurs. It then uses that to predict how many occurrences there will be of each of these keys in the next interval, and sets appropriate sample rates for each key.

Figure 1 shows an example of the behavior of a dynamic sampler under really simple conditions: the key fields amount to two URLs and a couple of error codes. The first bars (“Unsampled”) show the volume of data sent to Refinery for each key. Refinery chooses a sample rate for each key depending on its volume, and in this case the sampler is trying to achieve a throughput of 15 events. It was given 162 events as seen in the first (blue) bars. The middle (“Sent”) bars were how many events were actually sent to Honeycomb, and the last (“Displayed”) bars were the quantities seen in Honeycomb’s UI, after multiplying the counts of those events by the sample rate.

Dynamic sampling example graph.

Note that the displayed values are not exactly equal to the sent values, but they’re close. This is a side effect of sampling. The benefit is that you can send much less data and get nearly the same result.

EMA sampling

Some of the dynamic samplers are known as EMA (short for “Exponential Moving Average”) samplers. These samplers keep a bit of history, so that sample rates aren’t likely to change sharply between adjustment intervals.

As above, on each AdjustmentInterval, the EMA sampler keeps track of the quantity of each of the different keys. At the end of the interval, it mathematically combines these counts with the historical data. New keys are added to the list, and old keys might age out if they stop showing up as often.

Figure 2 shows an EMA sampler working, with its average sample rate automatically adjusting (both up and down) when a spike in data doubled the volume of its input for a few minutes. You can see a smooth increase in the average sample rate when the volume increases, and a smooth decrease when the spike ends. There is a brief increase in the data sent to Honeycomb, but it quickly returns to the target rate.

The EMA sampler hard at work.

The EMA sampler works, and it can work well. But it can also work badly if it’s poorly tuned, and it can be a bit mysterious to figure out what’s going on. To understand why, let’s use an analogy.

Don’t rock the boat

Imagine you’re on a boat in the ocean. The waves are about two feet high, separated by about 40 feet, and pass you every eight seconds. How comfortable are you?

It depends entirely on the size of the boat! If your boat is a 26-foot sailboat, you’re bobbing up and down with each passing wave and you might not be happy about it. If you’re on a cruise ship, you probably won’t even notice the waves. That’s because each boat is floating on its own view of the average ocean level.

For the cruise ship, it’s long enough that it stretches across multiple waves, and the individual waves can’t move it much. But the sailboat only sees part of a wave at any given time and so it has to ride up one side and down the other every few seconds.

The EMA sampler is like a boat with variable length. If the AdjustmentInterval is too short for the regular variations in telemetry, then the average will move around too much and it will be unstable. It will respond quickly to changes, but there’s always going to be some lag.

However, it’s even more subtle than just the interval. Remember, there’s a different sample rate for each key, and the set of keys in interval N will be used to predict the behavior in interval N+1. But if the keys are significantly different across the two intervals, the prediction won’t be—can’t be—valid.

What we need to do is choose a long enough AdjustmentInterval that we will see most of the keys in each interval. We’ve seen several customers with a pattern similar to this: 100 unique keys in 15 seconds, but 150 unique keys over 30 seconds, and 200 unique keys over one minute. After that, it flattens out; there are also about 200 keys over 5 minutes.

That tells us that if we set our EMA to 15 seconds, it will not see the same set of keys in each interval. This will make it impossible to stabilize the EMA. But if we set it to one minute, then it works well. 

Figure 3 shows this: On the left side, the sampler is receiving bursts of data every minute, but with an interval set to 15 seconds, it can only oscillate up and down with the waves. When we set the interval to 60 seconds, the sampler figures it out after about two minutes and is able to achieve a relatively constant sample rate.

On the left side, the sampler is receiving bursts of data every minute, but with an interval set to 15 seconds, it can only oscillate up and down with the waves. When we set the interval to 60 seconds, the sampler figures it out after about two minutes and is able to achieve a relatively constant sample rate.

How we’re hoping to improve it

Based on this research, we’ve been working on a system to make EMA easier to use. We’ve added new metrics to EMA samplers: we’re going to track the cardinality of the key space for the last AdjustmentInterval—also for twice that interval, and for four times that interval.

By comparing the values for 1x, 2x, and 4x the interval, Refinery can tell if the adjustment interval is too short—and if it is, automatically update it. It also slowly dials it back down if things are very stable.

We’re still testing it and working out some details for how it will work in cluster mode. You should see it available in a future Refinery release.


Don’t have access to Refinery support? Switch to Enterprise today!


Takeaways for EMA sampler users

1. Keep the key space as small as you can

As noted above, “cardinality” is the term we use for the size of your key space; it’s the number of distinct values. In our example above, the cardinality of our key was 4. The EMA sampler, by default, maxes out at a cardinality of 500 (this can be adjusted in config). It’s almost never going to be effective when there are that many keys.

The more keys an EMA sampler has to manage, the harder it is for it to achieve the desired throughput. One of the important things to remember is that a sampler can’t set the sample rate below 1, and it will always keep at least one of every key it sees. Therefore, if there are many low-volume keys, it will be harder for the sampler to reduce the volume of the big keys by enough, unless they’re very big.

2. Look at the cardinality of your keys at different time scale

Set up your Refinery to send metrics to Honeycomb, and then execute this query:

COUNT_DISTINCT(meta.refinery.sample_key)

That graph will show you the cardinality of your sample key. If you pull down the time range dropdown, there is a “Granularity” setting within it. Look at this graph over different time ranges; this will give you a good idea of how consistent your keys are over different time periods.

The granularity setting in Honeycomb.

3. Set AdjustmentInterval to a range that will achieve stability

Make sure that AdjustmentInterval is long enough that the majority of your keys will arrive in this interval.

4. Read the docs

Refinery’s documentation pages have a lot of information. If you need more detailed technical information, Refinery is an Open Source project, and we welcome feedback!

Conclusion

EMA sampling is great at making sampling easier in a changing world—but it’s also subtle, and might not do what you think it should. If your telemetry boat is rocking too hard, try increasing the AdjustmentInterval, and look forward to an even smoother voyage in a future Refinery!

Don’t forget to share!

Related posts