Getting Started With Refinery: Rules File Template

Getting Started with Refinery: Rules File Template

7 Min. Read

Sampling is a necessity for applications at scale. We at Honeycomb sample our data through the use of our Refinery tool, and we recommend that you do too. But how do you get started? Do you simply a set rate for all data and a handful of drop and keep rules, or is there more to it? What do these rules even mean, and how do you implement them?

To answer these questions, let’s look at a rules file template that we use for customers when first trying out Refinery.

rules:
  RulesVersion: 2

  Samplers:
    __default__:
      RulesBasedSampler:
        Rules:
          #Rule 1
          - Name: Keep 500 status codes
            SampleRate: 1
            Conditions:
              - Fields: 
                  - http.status_code
                  - http.response.status_code
                Operator: '>='
                Value: 500
                Datatype: int
          #Rule 2
          - Name: Keep where error field exists
            SampleRate: 1
            Conditions:
              - Field: error
                Operator: exists
          #Rule 3
          - Name: drop healthchecks
            Drop: true
            Scope: span
            Conditions:
              - Field: root.http.route
                Operator: starts-with
                Value: /healthz
              - Fields: 
                  - http.status_code
                  - http.response.status_code
                Operator: "="
                Value: 200
                Datatype: int
          #Rule 4
          - Name: Keep long duration traces
            SampleRate: 1
            Scope: span
            Conditions:
              - Field: trace.parent_id
                Operator: not-exists
              - Field: duration_ms
                Operator: ">="
                Value: 5000
                Datatype: int
          #Rule 5
          - Name: Dynamically Sample 200s through 400s
            Conditions:
              - Fields: 
                  - http.status_code
                  - http.response.status_code
                Operator: ">="
                Value: 200
                Datatype: int
            Sampler:
              EMADynamicSampler:
                GoalSampleRate: 10              # This is a sample rate itself
                FieldList:
                  - service.name
                  - root.http.route
                  - http.method
          #Rule 6
          - Name: Dynamically Sample Non-HTTP Request
            Conditions:
              - Field: status_code
                Operator: "<"
                Value: 2
                Datatype: int
            Sampler:
              EMADynamicSampler:
                GoalSampleRate: 10              # This is a sample rate itself
                FieldList:
                  - service.name
                  - grpc.method
                  - grpc.service
          #Rule 7
          - Name: Catchall rule
            Sampler:
              EMAThroughputSampler:
                GoalThroughputPerSec: 500 # This is spans per second for the entire cluster
                UseClusterSize: true # Ensures GoalThroughputPerSec is for the full refinery cluster and not per node
                FieldList:
                  - service.name

This file might look long, but it’s just seven rules that are general enough to help reduce event volume without much customization.

Sampling philosophy

TL;DR: drop boring data, keep rare/interesting data.

Boring data means things like fast, successful requests and health checks. Rare/interesting data would be things like anomalies, errors, unexpectedly slow requests, traces from especially important services or customers, etc. Things that you will take action on, want to alert on, or that are otherwise not the ideal state of things.

The above rules file follows this philosophy by keeping errors and abnormally long traces, dropping noisy health checks, and keeping only one out of every 10 fast, successful traces. This may seem like a high number of events to drop, but keep in mind that because Honeycomb weights calculations for sampling when Refinery is used, you don’t need to worry about confusion as to why the numbers don’t reflect expected traffic.


New to Honeycomb? Get your free account today.


Rules breakdown

This file uses three different samplers (in order): The Rules-Based Sampler, the EMA Dynamic Sampler, and the EMA Throughput Sampler. When a trace is assessed, the rules are reviewed sequentially from top to bottom and the first match (the first rule that applies to the trace) determines what action or sample rate is applied.

As such, we recommend starting with the most specific rules and working down to the most general. This file is no exception: it begins with keep and drop rules, followed by dynamic sampling rules, and ends with a catch-all rule utilizing the EMA Throughput Sampler.

Rule 1: Errors are (almost) always important. We want to make sure all status codes >= 500 are kept. A sampleRate of 1 means that 1 out of every 1 trace is kept, so traces that match are not sampled. In this rule, we’re looking for 500s in both the http.status_code and http.response.status_code fields.

Rule 2: Another error keep rule. We keep all traces that contain an error field that is not null.

Rule 3: Drop all health checks. Health checks are noisy, they skew your data, and can lead to event overages, so we want to drop them all. First, we set the scope to span. This means that every condition set must match on a single span. We do this because we want to be extra careful when dropping data. So, a trace will be dropped if any span contains a root.http.route that starts-with/healthz AND http.status_code or http.response.status_code fields with 200. Keep in mind that if you may need to modify the field or root.http.route value to match your own health check endpoint.

Rule 4: Another keep rule (SampleRate:1). We set the scope to span to define that the conditions must both be true on a single span. The first condition is looking for spans where trace.parent_idnot-exists, meaning it’s only looking at root spans. The next condition is that duration for the root span (and therefore the entire trace) must be >= 5000ms.

Rule 5: Here, we start to actually sample things! As status codes >= 500 are kept, we’re going to sample the rest at a rate of 10 (meaning, one in 10 is kept).

We use the EMA Dynamic Sampler here, which requires us to set a FieldList. This section sets which fields are used to build the sampling key. The key determines when the sampling rate should be increased/decreased based off of how frequently it occurs. For example, if key x is represented 100 times in a time window and key y 10 times, then traces matching key x will be dropped much more than those that match key y.

The FieldList should use fields that have some, but not too much cardinality. When cardinality is too high, everything is seen as rare and unique, therefore the sampler will retain more traces than it should. After setting the FieldList, you can check the number of keys being created with VISUALIZE COUNT_DISTINCT(meta.refinery.sampler_key . We advise keeping the cardinality to a combined value of less than 500.

Rule 6: Same as rule 5, but a different condition (non-http requests) and a different FieldList to build the sampling key. These fields better represent the data being targeted by the rule.

Rule 7: Lastly, we have our catchall rule using the EMA Throughput Sampler to target any traces that don’t fit any of the above situations. In our situation, we use GoalThroughputPerSec because we don’t know what traffic patterns are for the data coming in. This works to keep throughput at ~500 spans per second. We also set UseClusterSize to true because we’re targeting our entire Refinery cluster rather than per node.

Your rules will probably change over time

Refinery rules are meant to evolve as you bring new services online and instrument different parts of your code. While this file is a good place to start, you’ll get the best results by adding and customizing rules to better match your own data.

Regular review of sample rates and margin of error introduced, as well as making sure new environments are being accounted for, will help keep your sampling in optimal shape.

For help with rules customization or just to chat all things Refinery, check out the #discuss_sampling or #discuss_refinery channels in our community Slack!

Don’t forget to share!
Max Aguirre

Max Aguirre

Customer Architect

Max pride themselves on being a true generalist obsessed with problem-solving. Max has seven years of customer-focused observability experience and a healthy sprinkle of Kubernetes and Security. Since moving from Support to Customer Success they’ve primarily worked with enterprise software companies. Max likes to dive head-first into tense situations to calm the vibe, identify the best path forward, and build customer partnerships. Maybe even make some friends along the way.

Related posts