OpenTelemetry Best Practices #3: Data Prep and Cleansing

By: Martin Thwaites | June 24th, 2024

OpenTelemetry

5 Min. Read

Having telemetry is all well and good—amazing, in fact. It’s easy to do: add some OpenTelemetry auto-instrumentation libraries to your stack and they’ll fill your disks with data pretty quickly. However, having good telemetry data—data that’s curated into being useful—is something that is both cost-effective and represents good value.

Observability is about getting answers about how your production system is functioning by using telemetry data. If that data isn’t in an accurate, curated state, then you’ll struggle to get the answers you need—even if you have a ton of data. Either the data is confusing, or it’s locked away because of security concerns, or there’s just too much data to find the context you need. Because of this, it’s easy to get overwhelmed with bad data and feel like OpenTelemetry isn’t actually useful. Enter data prep and cleansing.

The Transform processor

With this processor, you can drop attributes that have names that are not ones you would want in your observability backend, such as firstname or creditcard. Further, you can also use the Transform processor to search for values in attributes such as password.

The processor allows you to perform the following actions on your spans:

Create a new attribute by parsing, searching, or combing existing attributes.
- E.g., combine a primary and secondary product category into a single value.
Delete attributes entirely.
- E.g., remove the social security number attribute when it goes to third parties.
Hash attributes to maintain their cardinality, without keeping personal identifiable information (PII).
- E.g., hash an auth token or API key used to access the system.

There are some known fields that you should consider whether to filter or not based on your context.

url.query and url.full: If you regularly use a query string for searching which could include anything that would be considered PII, you should think about whether you should filter this information out either globally, or specific to some URLs. You should also consider whether the engineering team should extract the most pertinent information from the url and add it as attributes in their code as this would provide a better telemetry experience.
network.peer.address and client.address: These fields can sometimes be populated with the IP address of the client accessing your site, and in some regulatory contexts, can be considered PII data. You could choose to hash this, however since the values are known, hashing doesn’t give the protection you might expect.

With these processors, you can also enrich your telemetry data with some static context data, like the cloud region or availability zone, through to additional information, such as what Collector infrastructure processed the request.

Redacting sensitive data

These processors allow you to drop and redact spans that meet certain criteria. The redaction processor can be configured with two modes that are important. I’ll call one “aggressive” and one “passive.”

In passive mode, you can tell the processor to look for specific patterns within your attributes. This could be looking for a pattern that resembles a card number or a social security number. To do this, we use regex patterns that scan each attribute.

You’ll need to apply your own context here. However, here’s a non-exhaustive list of what you should include:

Social Security Number (region-specific format)
National Insurance Number (UK-specific)
Credit card numbers (note: not all card numbers follow the same format)
Driver’s license numbers (region specific format)
Phone numbers
Postal codes / zip codes

In aggressive mode, in addition to looking for patterns in attributes, you’ll also provide a list of allowed attribute names. This means that any attributes not in the list will be dropped.

It’s best practice to run, at the very least, passive mode with some patterns that are specific to your region and sector. However, aggressive mode is something that would only really be applied in some very specific hyper-secure environments. It has limited use in that if data extraction is a concern, engineers could use the allowed parameters to include information they want to extract.

Balancing cardinality and PII

While we want to keep PII out of our telemetry backends, it’s often important to know the amount of individual users affected so that we can see if there’s a widespread problem. Or, to see what an individual user might have done over their lifetime.

We can maintain the cardinality (the distribution of the values) of this data by using a strong hash of the attribute using the Transform processor.

A word of caution: if the value you’re hashing has a small number of possible values and a predictable pattern, it’s relatively easy to reverse engineer the value. As such, this may not be a viable way to stop it being considered PII.

Get started today.
Try Honeycomb for free.

TRY NOW

Filtering non-useful spans

On top of redacting that sensitive data, and removing attributes, it’s also good practice to drop spans that aren’t useful. The most common are health check spans as they generally offer little value and can be dropped without affecting visibility into the system.

One thing to be careful of is that it will only filter based on a single span. If your health checks have a full trace structure, you may need to think about sampling (we’ll cover this in a different post).

More best practices coming soon

Building good observability pipelines—and by good, we mean pipelines that are safe—is part of what makes for a robust strategy. The Collector and its processors are a core part of that, and they’re luckily easy to configure.

If you missed part one or part two of my best practices series, you can find them here:

OpenTelemetry Best Practices #1: Naming

OpenTelemetry Best Practices #2: Agents, Sidecars, Collectors, Coded Instrumentation

Don’t forget to share!

Martin Thwaites

Principal Developer Advocate

Martin is a Developer Advocate at Honeycomb, o11y enthusiast, and a delivery-focused Developer from the UK. With over 20 years experience in development in the .NET ecosystem, he’s worked with many companies on scaling up engineering teams and products. The past few years have been spent working on solving complex problems with some of the UK’s big names, including e-commerce retailers and credit lenders.

Elsie Phillips | Jun 24, 2025

Observability Without Tradeoffs: Introducing Powerful New Honeycomb Telemetry Pipeline Features

With this release, you can more easily build and reconfigure telemetry pipelines and sample safely with the ability to easily pull full-fidelity data from your own archive whenever you need it.

News & Announcements OpenTelemetry Product Updates

Elsie Phillips | May 19, 2025

Introducing Native Mobile Support in Honeycomb for Frontend Observability

You shipped your latest release. You tested it on emulators, QA devices, and the latest OS versions. But now it’s live and running on thousands or millions of real devices, across a jungle of screen sizes, hardware specs, OS versions, and network conditions. A user reports a crash on an old Samsung device over 3G. Someone else complains the app feels “sluggish” after updating. You dig through logs. Rebuild test cases. Ping the backend team. Try to reproduce. Yet, still no answers.

Frontend Observability OpenTelemetry

Bee Klimt | May 15, 2025

Understanding Your App’s Health With Core Mobile Vitals

There are certain important metrics that every mobile app has in common. At Honeycomb, we have surveyed these metrics across iOS and Android, and have defined a set of Core Mobile Vitals we think every app developer should care about. The purpose of these vitals is similar to that of Core Web Vitals for web frontends.

Frontend OpenTelemetry

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission