Calculating Sampling’s Impact on SLOs and More

Sampling

By: Max Aguirre | August 30th, 2023

Sampling

5 Min. Read

What do mall food courts and Honeycomb have in common? We both love sampling! Not only do we recommend it to many of our customers, we do it ourselves. But once Refinery (our tail-based sampling proxy) is set up, what comes next?

Since sampling is inherently lossy, it’s good to be sure the organization’s most important measurements aren’t negatively affected. One option I recommend is to calculate the margin of error impact on your Service Level Objectives (SLOs) and triggers using Heinrich Hartmann’s Sampling Error Calculator.

What’s that about a calculator?

Heinrich Hartmann’s Sampling Error Calculator (source code) works to both estimate and simulate the margin of error introduced by a given sample rate, request rate, and time window. Keep in mind that the calculator gives us an estimation, and as such, the results won’t show the exact impact of sampling on margin of error. It’ll instead provide a general guideline to judge how much trust we can put into SLOs and Triggers.

Obtaining calculator inputs

Below is a service that we use internally to track our batch processing. The threshold is set at 5ms processing time per batch, with the expectation that some will take longer than others.

A service that we use internally to track our batch processing. The threshold is set at 5ms processing time per batch, with the expectation that some will take longer than others.

Enter usage mode to access the Sample Rate field added by Refinery. Since SLOs are bound to a specific dataset, we’ll navigate to Team Settings > Usage > Usage Mode for the dataset this SLI belongs to. Once in usage mode, make the following query:

VISUALIZE COUNT, MAX(Sample Rate), AVG(Sample Rate), MIN(Sample Rate) WHERE sli exists GROUP BY sli.

The time window can be adjusted based on your ingest. If you have lower counts (fewer events per second) returned, go ahead and bump it up to a 28 day window.

So, what do these results even mean? For the sli column, true means that the results fell within the SLI specification and false means there was a miss. In the image above, the average sample rate for true is ~12.81 or ~7.8% (1/12.81) and for false it’s ~6.05 or ~16.5% (1/6.05). We also need the count for true and false. To do this, hover over the count chart and find the peak values for each. For this example SLO, it’s 21011588 and 2026 events per second respectively.

Calculating the margin of error

Open up the Heinrich Hartmann Sampling Error Calculator and plug these numbers in. We’ll begin with sli = true. Start by entering in the sample rate of 7.8%, the request rate (our count/sec) of 2026, and the time window of 30 days (the SLO’s time period). This gives us a Relative Error Count of 0.07%. Plugging in the numbers for sli = false with the same time window gives us a Relative Error Count of 0.04%

We have some numbers! Cool, but what do they mean? The sli = true calculated Relative Error Count tells us our margin of error for the SLO’s historical compliance and sli = false gives us the margin of error for the SLO’s budget burndown.

This Honeycomb SLO has a burndown margin of error of +/- 0.04% and compliance margin of error of +/- 0.07%. Pretty good numbers! It’s also useful to run this again with the time window matching that of a burn alert to see how the margin of error changes with smaller periods of time.

To go one step further, repeat this calculation against triggers. Use the same method of going into usage mode and calculating based on the trigger’s query outputs the margin of error for alerts. This can help you to decide if an alert will be trusted by those it wakes up in the middle of the night. A 0.5% margin of error? You’ll take that seriously. A 15% margin? You probably won’t.

Room to tune

If your margin of error turns out to be unacceptably high, then it’s time to consider making some changes. Reducing sampling to increase fidelity is likely the path forward, but modifying the SLI formula an SLO is based on—or changing what a trigger targets—are also options.

Before simply slashing the sample rate, consider how the current Refinery rules affect the SLO or trigger in question. Perhaps the deterministic sampler, which uses a fixed sample rate, is used and changing to another method would function better. Or, maybe traces containing a key field are sampled too aggressively and we could add a specific rule.

In the SLO used above, we look at response.status_code = 200. It’s normal to sample 200s fairly aggressively, as successful responses often don’t make for the most interesting traces. However, if we’re too aggressive, we could impact this SLO negatively. Rather than setting a fixed sample rate for all 200s, it might work better to treat those which match handler.route = /1/batch/{dataset_name} differently than others and either keep all or apply a lower sample rate to them with rules-based sampler conditions. Here’s an example that keeps all matching traces:

RulesBasedSampler:
    Rules:
      - Name: keep 200 responses with route match
        SampleRate: 1
        Conditions:
            - Field: response.status_code
              Operator: =
              Value: 200
              Datatype: int
            - Field: handler.route
              Operator: contains
              Value: /1/batch/

Conclusion

We all want to get the most value out of the least amount of cost and processing overhead. Using sampling is the best approach to maintaining trace integrity and depth while removing some percentage of similar spans. By applying the approach outlined here, you can optimize your event usage.

Please keep in mind that business goals change, instrumentation changes, and as such, sampling rules and SLIs will need adjustment. Whenever you add a new SLI or refine an existing SLI, take a minute to ensure your data is sufficient to support the SLI at the desired target.

Want to read more about sampling? We have a few articles for you:

Achieving Great Dynamic Sampling with Refinery
Don’t Let Observability Inflate Your Cloud Costs
The Evolution of Sampling in Honeycomb: Introducing Refinery 2.0

Don’t forget to share!

Max Aguirre

Tyler Helmuth | Jan 22, 2025

Tracing Refinery

We recently released Refinery 2.9, which came with great performance improvements. Reading through the release notes, I felt the need to write a piece on this improvement, as it's quite important but easy to overlook: collect loop taking too long. This is the story of how we used distributed tracing to find the slowdown in this loop.

Sampling Tracing

Yingrong Zhao | Dec 10, 2024

Refinery 2.9: A Love Letter to Refinery’s Operators

Refinery is a powerful tail-based sampler—but with great power comes great challenges. We heard your feedback and are excited to announce the release of Refinery 2.9, a rather large update that is packed with goodies to make your life easier when running Refinery in your network.

Sampling

Kent Quirk | Oct 01, 2024

Refinery and EMA Sampling

Refinery is Honeycomb's sampling proxy, which our largest customers use to improve the value they get from their telemetry. It has a variety of interesting samplers to choose from. One category of these is called dynamic sampling. It's basically a technique for adjusting sample rates to account for the volume of incoming data—but doing so in a way that rare events get more priority than common events.

Observability Sampling

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Calculating Sampling’s Impact on SLOs and More

What’s that about a calculator?

Obtaining calculator inputs

Calculating the margin of error

Room to tune

Conclusion

Max Aguirre

Related posts

Tracing Refinery

Refinery 2.9: A Love Letter to Refinery’s Operators

Refinery and EMA Sampling

Ready to get started?

​​Calculating Sampling’s Impact on SLOs and More

What’s that about a calculator?

Obtaining calculator inputs

Calculating the margin of error

Room to tune

Conclusion

Max Aguirre

Related posts

Tracing Refinery

Refinery 2.9: A Love Letter to Refinery’s Operators

Refinery and EMA Sampling

Ready to get started?

Calculating Sampling’s Impact on SLOs and More