Instrumenting High Volume Services: Part 3

This is the last of three posts focusing on sampling as a part of your toolbox for handling services that generate large amounts of instrumentation data. The first one was an introduction to sampling and the second described simple methods to explore dynamic sampling.

In part 2, we explored partitioning events based on HTTP response codes, and assigning sample rates to each response code. That worked because of the small key space of HTTP status codes and because it’s known that errors are less frequent than successes. What do you do when the key space is too large to easily enumerate, or varies in a way you can’t predict ahead of time? The final step in discussing dynamic sample rates is to build in server logic to identify a key for each incoming event, then dynamically adjust the sample rate based on the volume of traffic for that key.

In all the following examples, the key used to determine the sample rate can be as simple (e.g. HTTP status code or customer ID) or complicated (e.g. concatenating the HTTP method, status code, and user-agent) as is appropriate to select samples that can give you the most useful view into the traffic possible. For example, here at Honeycomb we want to make sure that, despite a small set of customers sending us enormous volumes of traffic, we’re still able to see our long-tail customers’ traffic in our status graphs. We use a combination of the Dataset ID (to differentiate between customers) and HTTP method, URL, and HTTP status code (to identify different types of traffic they send).

The following methods all work by looking at historical traffic for a key and using that historical pattern to calculate the sample rate for that key. Specifically, the library linked at the end of this post implements all of these examples—it takes snapshots of current traffic for a short time period (say, 30 seconds) and uses the pattern of that traffic to determine the sample rates for the next period.

Enough with the preamble. Let’s sample!

Constant Throughput

For the constant throughput method, you specify the maximum number of events per time period you want to send for analysis. The algorithm then looks at all the keys detected over the snapshot and gives each key an equal portion of the throughput limit. It sets a minimum sample rate of 1, so that no key is completely ignored.

Example: for a throughput limit of 100, given 3 keys, each key should get to send 33 samples. Based on the level of traffic, the sample rate is calculated to try and get each key as close as possible to sending 33 events. For key a, 900 events divided by 33 rounds down to a sample rate of 27; for b: 90 events divided by 33 gives a sample rate of 3, and for c 10 events divided by 33 is less than one so rounds up to 1. During the next cycle, assuming the incoming event numbers are the same, these sample rates are used. 900 events at a sample rate of 27 will actually send 33 events; 90 events at a sample rate of 3 will send 30 events, and finally 10 events at a sample rate of 1 will send 10 events.

I’ll use a table like this in all the following examples. The key and traffic are used to calculate the sample rate. During the following iteration, the key, traffic, and sample rate will determine the number of actual events sent.

key traffic sample rate events sent
a 900 27 33
b 90 3 30
c 10 1 10

Advantages: If you know you have a relatively even split of traffic among your keys, and that you have fewer keys than your desired throughput rate, this method does a great job of capping the amount of resources you will spend sending data to your analytics.

Disadvantages: This approach doesn’t scale at all. As your traffic increases, the number of events you’re sending in to your analytics doesn’t, so your view in to the system gets more and more coarse, to the point where it will barely be useful. If you have keys with very little traffic (Key c, as seen above in the chart), you wind up under-sending the allotted samples for those keys and wasting some of your throughput limit. If your keyspace is very wide, you’ll end up sending more than the allotted throughput due to the minimum sample rate for each key.

Overall, this method can be useful as a slight improvement over the static map method because you don’t need to enumerate the sample rate for each key. It lets you contain your costs by sacrificing resolution in to your data. It breaks down as traffic scales in volume or in the size of the key space.

Constant Throughput Per Key

This is a minor tweak on the previous method to let it scale a bit more smoothly as the size of the key space increases (though not as volume increases). Instead of defining a limit on the total number of events to be sent, this algorithm’s goal is a maximum number of events sent per key. If there are more events than the desired number, the sample rate will be set to correctly collapse the actual traffic into the fixed volume.

Example: set the desired throughput per key to 50. Each key will send up to 50 events per time cycle, with the sample rate set to approximate the actual amount of traffic. The chart here is the same as before—key and traffic are used to compute the sample rate, then the traffic and sample rate are used to show how many events would be sent during the next iteration (assuming the incoming traffic is the same):

key traffic sample rate events sent
a 900 18 50
b 90 2 45
c 10 1 10

Advantages: Because the sample rate is fixed per key, you retain detail per key as the key space grows. When it’s simply important to get a minimum number of samples for every key, this is a good method to ensure that requirement.

Disadvantages: In order to avoid blowing out your metrics as your keyspace grows, you may need to set the per key limit relatively low, which gives you very poor resolution into the high volume keys. And as traffic grows within an individual key, you lose visibility into the details for that key.

This would be a good algorithm for something like an exception tracker, where more copies of the same exception don’t give you additional information (except that it’s still happening), but you want to make sure that you catch each different type of exception. When the presence of each key is the most important aspect, this works well.

Average Sample Rate

With this method, we’re starting to get fancier. The goal for this strategy is achieve a given overall sample rate across all traffic. However, we want to capture more of the infrequent traffic to retain high fidelity visibility. We accomplish both these goals by increasing the sample rate on high volume traffic and decreasing it on low volume traffic such that the overall sample rate remains constant. This gets us the best of both worlds - we catch rare events and still get a good picture of the shape of frequent events.

Here’s how the sample rate is calculated for each key: we count the total number of events that came in and divide by the sample rate to get the total number of events to send along to the analytics system. We then give each key an equal portion of the total number of events to send, and work backwards to determine what the sample rate should be.

Sticking with the same example traffic as the previous two methods, we have keys a, b, and c with traffic of 900, 90, and 10 events coming in. Let’s use a goal sample rate of 20. (900+90+10) / 20 = 50. Our goal for the total number of events to send in to Honeycomb is 50 events. We have 3 keys, so each key should get 50 / 3 = 17 events. What sample rate would we need for key a to send 17 events? 900 / 17 = 52 (rounded). For key b, 90 / 17 = 5 and for key c, 10 / 17 = 1. We now have our sample rates.

Here is the same table we used in the two previous examples:

key traffic sample rate events sent
a 900 52 17
b 90 5 18
c 10 1 10

Advantages: When rare events are more interesting than common events, and the volume of incoming events across the key spectrum is wildly different, the average sample rate is an excellent choice. Picking just one number (the target sample rate) is as easy as constant sampling but you magically get wonderful resolution into the long tail of your traffic while still keeping your overall traffic volume manageable.

Disadvantages: high volume traffic is sampled very aggressively.

At Honeycomb (and in the library below) we actually apply one additional twist to the Average Sample Rate method. The description above weights all keys equally. But shouldn’t high volume keys actually have more representation than low volume keys? We choose a middle ground by using the logarithm of the count per key to influence how much of the total number of events sent into Honeycomb are assigned to each key—a key with 10^x the volume of incoming traffic will have x times the representation in the sampled traffic. For more details, take a look at the implementation for the average sample rate method linked below.

Average Sample Rate with Minimum Per Key

To really mix things up, let’s combine two methods! Since we’re choosing the sample rate for each key dynamically, there’s no reason why we can’t also choose which method we use to determine that sample rate dynamically!

One disadvantage of the average sample rate method is that if you set a high target sample rate but have very little traffic, you will wind up over-sampling traffic you could actually send with a lower sample rate. For example, consider setting a target sample rate of 50 but then only actually having 30 events total! Clearly there’s no need to sample so heavily when you have very little traffic. So what should you do when your traffic patterns are such that one method doesn’t always fit? Use two!

When your traffic is below 1,000 events per 30 seconds, don’t sample. When you exceed 1,000 events during your 30 second sample window, switch to average sample rate with a target sample rate of 100.

By combining different methods together, you mitigate each of their disadvantages and keep full detail when you have the capacity, and gradually drop more of your traffic as volume grows.

Conclusion

Sampling is great. It is the only reasonable way to keep high value contextually aware information about your service while still being able to scale to a high volume. As your service increases, you’ll find yourself sampling at 1001, 10001, 500001. At these volumes, statistics will let you be sure that any problem will eventually make it through your sample selection, and using a dynamic sampling method will make sure the odds are in your favor.

We’ve implemented the sampling methods mentioned here as a go library and released them at https://github.com/honeycombio/dynsampler-go. We would love additional methods of sampling as contributions!

Instrument your service to create wide, contextual events. Sample them in a way that lets you get good visibility into the areas of your service that need the most introspection, and sign up for Honeycomb to slice up your data today!

This is part of a 3-part series:

Have thoughts on this post? Let us know via Twitter @honeycombio.