Dynamic Sampling by Example

By: Liz Fong-Jones | May 17th, 2019

Instrumentation Sampling Software Engineering

4 Min. Read

Having more than one static sample rate

If the sampling rate is high, whether due to being dynamically or statically set high, we need to consider that we’ll miss long tail events — for instance, errors or high latency events, because the chance that a 99.9th percentile outlier will also be chosen for random sampling is slim. Likewise, we may want to have at least some data for each of our distinct customers rather than have the high-volume customers drown out the low-volume customers.

So, we’d like to sample events by a property of the event itself, such as the return status, latency, endpoint, or a high cardinality field like customer ID. For properties present in the request itself such as endpoint or customer ID, we can perform “head sampling” and make the decision to sample or not at the start of execution and propagate that decision further downstream (e.g. with a “require sampling” header bit) so we get full traces.

But for return status and latency, we know only in retrospect whether they’re interesting outliers; this is “tail sampling”. Downstream services already have independently chosen whether to discard or instrument, so at best we’ll have the outlying downstream spans, but none of the other context. To collect full traces and perform tail sampling, some collector-side logic is required to buffer entire traces and retrospectively decide what to keep. This buffered sampling technique is not feasible entirely from within the instrumented code.

Let’s start varying the sample rates by key. We can sample the baseline non-outlier events at 1 in 1000 and choose to tail sample the errors & slow queries 1:1 or 1:5. This is still vulnerable to spikes of instrumentation cost if we get a spike in the rate of errors. Modifying the original flat sampling code, we get:

var sampleRate = flag.Int("sampleRate", 1000, "Service's sample rate")
var outlierSampleRate = flag.Int("outlierSampleRate", 5, "Outlier sample rate")

func handler(resp http.ResponseWriter, req *http.Request) {
	start := time.Now()
	i, err := callAnotherService(r)
	resp.Write(i)

	r := rand.Float64()
	if err != nil || time.Since(start) > 500*time.Millisecond {
		if r < 1.0 / *outlierSampleRate {
			RecordEvent(req, *outlierSampleRate, start, err)
		}
	} else {
		if r < 1.0 / *sampleRate {
			RecordEvent(req, *sampleRate, start, err)
		}
	}
}

So we can support having multiple different sample rates. But how does this work with Target Rate Sampling?

Sampling by key and target rate

Putting the two techniques together: let’s extend what we’ve already done to target specific rates of instrumentation: if a request is anomalous (has latency above 500ms or is an error), let’s choose it for tail sampling at its own guaranteed rate, while ratelimiting the other requests to fit within a budget of instrumented requests per second as per before:

var targetEventsPerSec = flag.Int("targetEventsPerSec", 4, "The target number of ordinary requests per second to sample from this service.")
var outlierEventsPerSec = flag.Int("outlierEventsPerSec", 1, "The target number of outlier requests per second to sample from this service.")

var sampleRate float64 = 1.0
var requestsInPastMinute *int

var outlierSampleRate float64 = 1.0
var outliersInPastMinute *int

func main() {
	// Initialize counters.
	rc := 0
	requestsInPastMinute = &rc
	oc := 0
	outliersInPastMinute = &oc

	go func() {
		for {
			time.Sleep(time.Minute)
			newSampleRate = *requestsInPastMinute / (60 * *targetEventsPerSec)
			if newSampleRate < 1 {
				sampleRate = 1.0
			} else {
				sampleRate = newSampleRate
			}
			newRequestCounter := 0
			requestsInPastMinute = &newRequestCounter

			newOutlierRate = outliersInPastMinute / (60 * *outlierEventsPerSec)
			if newOutlierRate < 1 {
				outlierSampleRate = 1.0
			} else {
				outlierSampleRate = newOutlierRate
			}
			newOutlierCounter := 0
			outliersInPastMinute = &newOutlierCounter
		}
	}()
	http.Handle("/", handler)
	[...]
}

func handler(resp http.ResponseWriter, req *http.Request) {
	var r float64
	if r, err := floatFromHexBytes(req.Header.Get("Sampling-ID")); err != nil {
		r = rand.Float64()
	}
	start := time.Now()
	i, err := callAnotherService(r)
	resp.Write(i)
	if err != nil || time.Since(start) > 500*time.Millisecond {
		*outliersInPastMinute++
		if r < 1.0 / outlierSampleRate {
			RecordEvent(req, outlierSampleRate, start, err)
		}
	} else {
		*requestsInPastMinute++
		if r < 1.0 / sampleRate {
			RecordEvent(req, sampleRate, start, err)
		}
	}
}

Whew. That has a number of awkward cut-pastes, so we probably shouldn’t paste again to support a third category. Instead, we need to support arbitrarily many keys.

Don’t forget to share!

Liz Fong-Jones

Field CTO

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Tyler Helmuth | Jan 22, 2025

Tracing Refinery

We recently released Refinery 2.9, which came with great performance improvements. Reading through the release notes, I felt the need to write a piece on this improvement, as it's quite important but easy to overlook: collect loop taking too long. This is the story of how we used distributed tracing to find the slowdown in this loop.

Sampling Tracing

Yingrong Zhao | Dec 10, 2024

Refinery 2.9: A Love Letter to Refinery’s Operators

Refinery is a powerful tail-based sampler—but with great power comes great challenges. We heard your feedback and are excited to announce the release of Refinery 2.9, a rather large update that is packed with goodies to make your life easier when running Refinery in your network.

Sampling

Kent Quirk | Oct 01, 2024

Refinery and EMA Sampling

Refinery is Honeycomb's sampling proxy, which our largest customers use to improve the value they get from their telemetry. It has a variety of interesting samplers to choose from. One category of these is called dynamic sampling. It's basically a technique for adjusting sample rates to account for the volume of incoming data—but doing so in a way that rare events get more priority than common events.

Observability Sampling

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission