Dynamic Sampling by Example

By: Liz Fong-Jones | May 17th, 2019

Instrumentation Sampling Software Engineering

4 Min. Read

Recording the sample rate

What if we need to change the flagged value at some point in the future? The instrumentation collector wouldn’t know exactly when the value changed. Thus, it’s better to explicitly pass the current sampleRate when sending a sampled event — indicating the event statistically represents sampleRate similar events.

2x 'ok' at rate 100, 3x 'ok' at rate 80, and 2x 'err' at rate 1

// Note: sampleRate can be specific to this service and doesn't have to be universal!
var sampleRate = flag.Int("sampleRate", 1000, "Service's sample rate")

func handler(resp http.ResponseWriter, req *http.Request) {
	start := time.Now()
	i, err := callAnotherService()
	resp.Write(i)

	r := rand.Float64()
	if r < 1.0 / *sampleRate {
		RecordEvent(req, *sampleRate, start, err)
	}
}

This way, we can keep track of the sampling rate in effect when each sampled event was recorded. This gives us the data to accurately calculate even if the sampling rate is different. For example, if we were trying to calculate the total number of events meeting a filter such as “err != nil“, we’d multiply the count of seen events with “err != nil” by each’s sampleRate. And if we were trying to calculate the sum of durationMs, we’d need to weight each sampled event’s durationMs, multiplying it by sampleRate before adding the weighted figures all up.

200/1, and 240/1 after reweighting

There’s more to consider about how sampling rates and tracing work together though in the next section.

Consistent sampling

We also need to consider how sampling interacts with tracing. Instead of independently generating a sampling decision inside of each handler, we should use a centrally generated “sampling/tracing ID” propagated to all downstream handlers. Why? This lets us make consistent sampling decisions between different manifestations of the same end user’s request. It would be unfortunate to discover that we have sampled an error far downstream for which the upstream context is missing because it was dropped. Consistent sampling guarantees that if a 1:100 sampling occurs, a 1:99, 1:98, etc. sampling preceding or following it also preserves the execution context. And half of the events chosen by a 1:100 sampling will be present under a 1:200 sampling.

bitcoin hash-like set of hashes, some of which end in '000' and are selected; others of which are dropped.

var sampleRate = flag.Int("sampleRate", 1000, "Service's sample rate")

func handler(resp http.ResponseWriter, req *http.Request) {
	// Use an upstream-generated random sampling ID if it exists.
	// otherwise we're a root span. generate & pass down a random ID.
	var r float64
	if r, err := floatFromHexBytes(req.Header.Get("Sampling-ID")); err != nil {
		r = rand.Float64()
	}

	start := time.Now()
	// Propagate the Sampling-ID when creating a child span
	i, err := callAnotherService(r)
	resp.Write(i)

	if r < 1.0 / *sampleRate {
		RecordEvent(req, *sampleRate, start, err)
	}
}

Now we have support for adjusting the sample rate without recompiling, including at runtime. But why manually adjust the rate? In the next chapter, we’ll discuss Target Rate Sampling.

Target Rate Sampling

We don’t need to manually flag-adjust the sampling rates for each of our services as traffic swells and sags; instead, we can automate this by tracking the incoming request rate that we’re receiving!

spiking graph of rate, reacting decrease in probabiliy, and smoothed spike

var targetEventsPerSec = flag.Int("targetEventsPerSec", 5, "The target number of requests per second to sample from this service.")

// Note: sampleRate can be a float! doesn't have to be an integer.
var sampleRate float64 = 1.0
// Track requests from previous minute to decide sampling rate for the next minute.
var requestsInPastMinute *int

func main() {
	// Initialize counters.
	rc := 0
	requestsInPastMinute = &rc

	go func() {
		for {
			time.Sleep(time.Minute)
			newSampleRate = *requestsInPastMinute / (60 * *targetEventsPerSec)
			if newSampleRate < 1 {
				sampleRate = 1.0
			} else {
				sampleRate = newSampleRate
			}
			newRequestCounter := 0
			// Production code would do something less race-y, but this is readable
			requestsInPastMinute = &newRequestCounter
		}
	}()
	http.Handle("/", handler)
	[...]
}

func handler(resp http.ResponseWriter, req *http.Request) {
	var r float64
	if r, err := floatFromHexBytes(req.Header.Get("Sampling-ID")); err != nil {
		r = rand.Float64()
	}

	start := time.Now()
	*requestsInPastMinute++
	i, err := callAnotherService(r)
	resp.Write(i)

	if r < 1.0 / sampleRate {
		RecordEvent(req, sampleRate, start, err)
	}
}

The previous code lets us have a predictable retention window (or bill, with another collection service). However, it has one significant drawback, which we’ll address in the next chapter on per-key rates.

Don’t forget to share!

Liz Fong-Jones

Field CTO

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Charity Majors | Apr 23, 2025

How Much Should I Be Spending On Observability?

In last week’s piece, we talked about some of the factors that are driving costs up, both good and bad, and about whether your observability bill is (or should be) more of a cost center or an investment. In this piece, I’m going to talk more in depth about cost drivers and levers of control.

Observability Sampling

Irving Popovetsky | Apr 21, 2025

Data Strategy for SREs and Observability Teams

The idea that telemetry data needs to be managed, or needs a strategy, draws a lot of inspiration from the data world (as in, BI and Data Engineering). Your company most likely has a data team that manages the data warehouse(s), data pipelines, data sources, and reporting tools. These teams are also constantly balancing costs with their user and stakeholder needs, usability, data retention, granularity, etc. Sound familiar? That’s because if you’re working on observability data, these teams are at least several years ahead of you in addressing these tradeoffs and considerations—and can teach us quite a lot.

Observability Sampling Software Engineering

Tyler Helmuth | Jan 22, 2025

Tracing Refinery

We recently released Refinery 2.9, which came with great performance improvements. Reading through the release notes, I felt the need to write a piece on this improvement, as it's quite important but easy to overlook: collect loop taking too long. This is the story of how we used distributed tracing to find the slowdown in this loop.

Sampling Tracing

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission