Ask Miss O11y: Load Testing With Fidelity

By: Liz Fong-Jones | December 9th, 2021

Ask Miss O11y Sampling Service Level Objectives

4 Min. Read

Dear Miss O11y,

My developers and I can’t agree about what the right approach is for running load tests in production. Should we even be running load tests against our production infrastructure or is it too risky? And what about ensuring our service-level objectives (SLOs) are correct? And not wanting to overload our observability provider or getting a surprise bill?

–Perplexed About Performance

Hey Perplexed,

We’re huge advocates for testing in production, and that includes performing chaos engineering/continuous verification in production. But you’re right to want to be cautious about exactly how you are performing these tests. After all, it’s chaos engineering, not just pure unbridled chaos.

In order to stress test responsibly, you’ll want to determine in advance a hypothesis you want to test, know how you intend to measure the results of the experiment, and have an emergency stop button should things go awry. Additionally, you’ll want to have plenty of spare error budget in case things do go wrong; if you’re running out of error budget already, chances are you have plenty of known unknowns you need to deal with first before you go in search of unknown unknowns.

Therefore, when it comes to doing load testing in production, you’ll need to understand how much headroom you have off-peak and on-peak—if you just keep slamming your service with the same level of extra traffic that’s safe off-peak, you’ll wind up failing on-peak when that extra customer load is added to your load test. You’ll also want to ensure everyone is in the loop on when and where the test is happening, and how to disable it in event of problems.

The question you’re alluding to around observability and SLOs is a great one. We feel SLOs should be measurements of the health of requests from end users, rather than artificial traffic. As our CTO Charity Majors likes to say, “Nines don’t matter if users aren’t happy.” If your SLO is flooded with millions of “successful” requests from your load test, but your users are seeing errors, then your SLO claiming that it’s in compliance is not reflecting actual user experience.

Conversely, if you need to back out your load test because it’s causing too many errors, but you have protected end user traffic with traffic prioritization or QoS headers, you shouldn’t be penalized for “missing” your SLO even when no real users have suffered! Honeycomb SLOs make it easy to qualify and exclude spans with a certain attribute from consideration in a service-level indicator (SLI). Set IF(AND(NOT(EXISTS($app.is_loadtest)), EQUALS($service.name, “my_service_name”), […]), […]) and proceed onward with your test, knowing you won’t be counting load test traffic towards your SLO.

And with regard to observability, you should have visibility into both the performance of load test traffic and real end-user traffic; however, not all that traffic is equally valuable. Real end-user traffic is multifaceted and diverse in all dimensions, whereas artificially generated traffic tends to look very self-similar and is of lower value. As long as you are differentiating artificial from end-user traffic in your telemetry data about that traffic, such as using HTTP headers in the client request and setting an attribute on the root span, you can treat that traffic differently for sampling purposes.

Tail sampling with Honeycomb Refinery frees you from having to propagate that information all the way downstream, and can apply a different sampling rate to the entire downstream trace based on the presence or absence of the load testing attribute in the root span. You’ll be able to, without blowing out your observability bill, continue to debug end-user traffic in full fidelity, while also getting a snapshot of how the load test is performing and where it might be slowing down. Being able to visualize what’s genuine and what’s part of the test can also help clarify operational alerting and response—your operational dashboards should include and highlight what is extra traffic, but your user experience reporting should exclude it.

(Load) Testing in production is a great idea, as long as you have the appropriate guardrails. And it might even help you exercise the paths that generate telemetry, ensuring that you aren’t having too much overhead from emitting and processing the telemetry. Make sure you’re heavily sampling load test traffic and excluding it from your SLIs and SLOs, and you’ll be in excellent shape.

May today be a good day to test in production!

Liz

Have a question for Miss O11y? Send us an email!

Don’t forget to share!

Liz Fong-Jones

Field CTO

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Tyler Helmuth | Jan 22, 2025

Tracing Refinery

We recently released Refinery 2.9, which came with great performance improvements. Reading through the release notes, I felt the need to write a piece on this improvement, as it's quite important but easy to overlook: collect loop taking too long. This is the story of how we used distributed tracing to find the slowdown in this loop.

Sampling Tracing

Yingrong Zhao | Dec 10, 2024

Refinery 2.9: A Love Letter to Refinery’s Operators

Refinery is a powerful tail-based sampler—but with great power comes great challenges. We heard your feedback and are excited to announce the release of Refinery 2.9, a rather large update that is packed with goodies to make your life easier when running Refinery in your network.

Sampling

Kent Quirk | Oct 01, 2024

Refinery and EMA Sampling

Refinery is Honeycomb's sampling proxy, which our largest customers use to improve the value they get from their telemetry. It has a variety of interesting samplers to choose from. One category of these is called dynamic sampling. It's basically a technique for adjusting sample rates to account for the volume of incoming data—but doing so in a way that rare events get more priority than common events.

Observability Sampling

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Ask Miss O11y: Load Testing With Fidelity

Liz Fong-Jones

Related posts

Tracing Refinery

Refinery 2.9: A Love Letter to Refinery’s Operators

Refinery and EMA Sampling

Ready to get started?