New to Honeycomb? Get your free account today.
Sampling is a necessity for applications at scale. We at Honeycomb sample our data through the use of our Refinery tool, and we recommend that you do too. But how do you get started? Do you simply a set rate for all data and a handful of drop and keep rules, or is there more to it? What do these rules even mean, and how do you implement them?
To answer these questions, let’s look at a rules file template that we use for customers when first trying out Refinery.
rules: RulesVersion: 2 Samplers: __default__: RulesBasedSampler: Rules: #Rule 1 - Name: Keep 500 status codes SampleRate: 1 Conditions: - Fields: - http.status_code - http.response.status_code Operator: '>=' Value: 500 Datatype: int #Rule 2 - Name: Keep where error field exists SampleRate: 1 Conditions: - Field: error Operator: exists #Rule 3 - Name: drop healthchecks Drop: true Scope: span Conditions: - Field: root.http.route Operator: starts-with Value: /healthz - Fields: - http.status_code - http.response.status_code Operator: "=" Value: 200 Datatype: int #Rule 4 - Name: Keep long duration traces SampleRate: 1 Scope: span Conditions: - Field: trace.parent_id Operator: not-exists - Field: duration_ms Operator: ">=" Value: 5000 Datatype: int #Rule 5 - Name: Dynamically Sample 200s through 400s Conditions: - Fields: - http.status_code - http.response.status_code Operator: ">=" Value: 200 Datatype: int Sampler: EMADynamicSampler: GoalSampleRate: 10 # This is a sample rate itself FieldList: - service.name - root.http.route - http.method #Rule 6 - Name: Dynamically Sample Non-HTTP Request Conditions: - Field: status_code Operator: "<" Value: 2 Datatype: int Sampler: EMADynamicSampler: GoalSampleRate: 10 # This is a sample rate itself FieldList: - service.name - grpc.method - grpc.service #Rule 7 - Name: Catchall rule Sampler: EMAThroughputSampler: GoalThroughputPerSec: 500 # This is spans per second for the entire cluster UseClusterSize: true # Ensures GoalThroughputPerSec is for the full refinery cluster and not per node FieldList: - service.name
This file might look long, but it’s just seven rules that are general enough to help reduce event volume without much customization.
Sampling philosophy
TL;DR: drop boring data, keep rare/interesting data.
Boring data means things like fast, successful requests and health checks. Rare/interesting data would be things like anomalies, errors, unexpectedly slow requests, traces from especially important services or customers, etc. Things that you will take action on, want to alert on, or that are otherwise not the ideal state of things.
The above rules file follows this philosophy by keeping errors and abnormally long traces, dropping noisy health checks, and keeping only one out of every 10 fast, successful traces. This may seem like a high number of events to drop, but keep in mind that because Honeycomb weights calculations for sampling when Refinery is used, you don’t need to worry about confusion as to why the numbers don’t reflect expected traffic.
Rules breakdown
This file uses three different samplers (in order): The Rules-Based Sampler, the EMA Dynamic Sampler, and the EMA Throughput Sampler. When a trace is assessed, the rules are reviewed sequentially from top to bottom and the first match (the first rule that applies to the trace) determines what action or sample rate is applied.
As such, we recommend starting with the most specific rules and working down to the most general. This file is no exception: it begins with keep and drop rules, followed by dynamic sampling rules, and ends with a catch-all rule utilizing the EMA Throughput Sampler.
Rule 1: Errors are (almost) always important. We want to make sure all status codes >= 500
are kept. A sampleRate
of 1 means that 1 out of every 1 trace is kept, so traces that match are not sampled. In this rule, we’re looking for 500s in both the http.status_code
and http.response.status_code
fields.
Rule 2: Another error keep rule. We keep all traces that contain an error
field that is not null.
Rule 3: Drop all health checks. Health checks are noisy, they skew your data, and can lead to event overages, so we want to drop them all. First, we set the scope to span
. This means that every condition set must match on a single span. We do this because we want to be extra careful when dropping data. So, a trace will be dropped if any span contains a root.http.route
that starts-with/healthz
AND http.status_code
or http.response.status_code
fields with 200
. Keep in mind that if you may need to modify the field or root.http.route
value to match your own health check endpoint.
Rule 4: Another keep rule (SampleRate:1). We set the scope to span
to define that the conditions must both be true on a single span. The first condition is looking for spans where trace.parent_id
not-exists
, meaning it’s only looking at root spans. The next condition is that duration for the root span (and therefore the entire trace) must be >= 5000ms.
Rule 5: Here, we start to actually sample things! As status codes >= 500
are kept, we’re going to sample the rest at a rate of 10 (meaning, one in 10 is kept).
We use the EMA Dynamic Sampler here, which requires us to set a FieldList
. This section sets which fields are used to build the sampling key. The key determines when the sampling rate should be increased/decreased based off of how frequently it occurs. For example, if key x is represented 100 times in a time window and key y 10 times, then traces matching key x will be dropped much more than those that match key y.
The FieldList
should use fields that have some, but not too much cardinality. When cardinality is too high, everything is seen as rare and unique, therefore the sampler will retain more traces than it should. After setting the FieldList
, you can check the number of keys being created with VISUALIZE COUNT_DISTINCT(meta.refinery.sampler_key
. We advise keeping the cardinality to a combined value of less than 500.
Rule 6: Same as rule 5, but a different condition (non-http requests) and a different FieldList
to build the sampling key. These fields better represent the data being targeted by the rule.
Rule 7: Lastly, we have our catchall rule using the EMA Throughput Sampler to target any traces that don’t fit any of the above situations. In our situation, we use GoalThroughputPerSec
because we don’t know what traffic patterns are for the data coming in. This works to keep throughput at ~500 spans per second. We also set UseClusterSize
to true because we’re targeting our entire Refinery cluster rather than per node.
Your rules will probably change over time
Refinery rules are meant to evolve as you bring new services online and instrument different parts of your code. While this file is a good place to start, you’ll get the best results by adding and customizing rules to better match your own data.
Regular review of sample rates and margin of error introduced, as well as making sure new environments are being accounted for, will help keep your sampling in optimal shape.
For help with rules customization or just to chat all things Refinery, check out the #discuss_sampling or #discuss_refinery channels in our community Slack!