Dynamic Sampling by Example

Query count and sampling chance plotted

4 Min. Read

Recording the sample rate

What if we need to change the flagged value at some point in the future? The instrumentation collector wouldn’t know exactly when the value changed. Thus, it’s better to explicitly pass the current sampleRate when sending a sampled event — indicating the event statistically represents sampleRate similar events.

2x 'ok' at rate 100, 3x 'ok' at rate 80, and 2x 'err' at rate 1

// Note: sampleRate can be specific to this service and doesn't have to be universal!
var sampleRate = flag.Int("sampleRate", 1000, "Service's sample rate")

func handler(resp http.ResponseWriter, req *http.Request) {
	start := time.Now()
	i, err := callAnotherService()
	resp.Write(i)

	r := rand.Float64()
	if r < 1.0 / *sampleRate {
		RecordEvent(req, *sampleRate, start, err)
	}
}

This way, we can keep track of the sampling rate in effect when each sampled event was recorded. This gives us the data to accurately calculate even if the sampling rate is different. For example, if we were trying to calculate the total number of events meeting a filter such as “err != nil“, we’d multiply the count of seen events with “err != nil” by each’s sampleRate. And if we were trying to calculate the sum of durationMs, we’d need to weight each sampled event’s durationMs, multiplying it by sampleRate before adding the weighted figures all up.

200/1, and 240/1 after reweighting

There’s more to consider about how sampling rates and tracing work together though in the next section.

Consistent sampling

We also need to consider how sampling interacts with tracing. Instead of independently generating a sampling decision inside of each handler, we should use a centrally generated “sampling/tracing ID” propagated to all downstream handlers. Why? This lets us make consistent sampling decisions between different manifestations of the same end user’s request. It would be unfortunate to discover that we have sampled an error far downstream for which the upstream context is missing because it was dropped. Consistent sampling guarantees that if a 1:100 sampling occurs, a 1:99, 1:98, etc. sampling preceding or following it also preserves the execution context. And half of the events chosen by a 1:100 sampling will be present under a 1:200 sampling.

bitcoin hash-like set of hashes, some of which end in '000' and are selected; others of which are dropped.

var sampleRate = flag.Int("sampleRate", 1000, "Service's sample rate")

func handler(resp http.ResponseWriter, req *http.Request) {
	// Use an upstream-generated random sampling ID if it exists.
	// otherwise we're a root span. generate & pass down a random ID.
	var r float64
	if r, err := floatFromHexBytes(req.Header.Get("Sampling-ID")); err != nil {
		r = rand.Float64()
	}

	start := time.Now()
	// Propagate the Sampling-ID when creating a child span
	i, err := callAnotherService(r)
	resp.Write(i)

	if r < 1.0 / *sampleRate {
		RecordEvent(req, *sampleRate, start, err)
	}
}

Now we have support for adjusting the sample rate without recompiling, including at runtime. But why manually adjust the rate? In the next chapter, we’ll discuss Target Rate Sampling.

Target Rate Sampling

We don’t need to manually flag-adjust the sampling rates for each of our services as traffic swells and sags; instead, we can automate this by tracking the incoming request rate that we’re receiving!

spiking graph of rate, reacting decrease in probabiliy, and smoothed spike

var targetEventsPerSec = flag.Int("targetEventsPerSec", 5, "The target number of requests per second to sample from this service.")

// Note: sampleRate can be a float! doesn't have to be an integer.
var sampleRate float64 = 1.0
// Track requests from previous minute to decide sampling rate for the next minute.
var requestsInPastMinute *int

func main() {
	// Initialize counters.
	rc := 0
	requestsInPastMinute = &rc

	go func() {
		for {
			time.Sleep(time.Minute)
			newSampleRate = *requestsInPastMinute / (60 * *targetEventsPerSec)
			if newSampleRate < 1 {
				sampleRate = 1.0
			} else {
				sampleRate = newSampleRate
			}
			newRequestCounter := 0
			// Production code would do something less race-y, but this is readable
			requestsInPastMinute = &newRequestCounter
		}
	}()
	http.Handle("/", handler)
	[...]
}

func handler(resp http.ResponseWriter, req *http.Request) {
	var r float64
	if r, err := floatFromHexBytes(req.Header.Get("Sampling-ID")); err != nil {
		r = rand.Float64()
	}

	start := time.Now()
	*requestsInPastMinute++
	i, err := callAnotherService(r)
	resp.Write(i)

	if r < 1.0 / sampleRate {
		RecordEvent(req, sampleRate, start, err)
	}
}

The previous code lets us have a predictable retention window (or bill, with another collection service). However, it has one significant drawback, which we’ll address in the next chapter on per-key rates.

Don’t forget to share!
Liz Fong-Jones

Liz Fong-Jones

Field CTO

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Related posts