Written by Liz Fong-Jones and Phillip Carter.
OpenTelemetry, also known as OTel, is a CNCF open standard that enables distributed tracing and metrics collection from your applications. At Honeycomb, we believe that OpenTelemetry is the best way to ingest the high-cardinality and high-dimensional data that every system, no matter how complex or distributed, needs for observability. We also like that you can send telemetry to multiple backends and change backends themselves without rewriting your code.
Recently, Honeycomb held a roundtable discussion (available on demand) with Camal Cakar, Senior Site Reliability Engineer at Jimdo; Pierre Lacerte, Director of Software Development at Upgrade; and Kristin Smith, DevOps Engineer at Campspot. We talked about using OpenTelemetry and explained some important lessons the panelists learned—and are still learning in some cases.
Here are five best practices based on these lessons and Honeycomb’s own experience with OpenTelemetry.
1. Make your OTel reservation automatically: start with auto-instrumentation and add manual instrumentation later
When you’re getting started with OpenTelemetry (beginners: see our first blog in this series for more details), you should take advantage of OpenTelemetry’s automated instrumentation. Yes, auto-instrumentation casts a wide net and won’t necessarily provide a lot of specialized information, but you can get some quick wins from it—like Jimdo did. “We saw bad performing database queries,” Cakar told the roundtable, “and we could fix that by setting the correct index.”
Once you benefit from auto-instrumentation, manual instrumentation can come into play. For example, if the automatic instrumentation tells you a service you built is slower than it should be, you can add manual instrumentation to dig into why. Little by little, you use the manual option to get more details. Basically, this is like booking a hotel room at random online, but after staying a few times, you use the system to reserve a specific kind of room.
The roundtable agreed that automatic and manual instrumentation are best used together. Lacerte shared that Upgrade considered replacing auto-instrumentation with manual instrumentation, but it wasn’t optimal because of the time and effort involved. Cakar had this to say about both: “I think everyone has a problem they want to solve that’s complex, and instrumentation is great for that—in any kind of way.”
2. Turn on the OTel room overhead light first: start with all the auto-instrumentation you can, then dial it back
You should turn on as much automatic instrumentation as you can at first to see what you can get out of it. Then, dial back anything that is not useful. The reason for doing this is that it’s better to have too much information than not enough. If you have too little instrumentation, it is likely that you will miss important observability data and not even know it.
Cakar told us this story about not instrumenting enough: “Some weeks after we instrumented just one service, we were looking at the data and there was a broken trace. One of the engineers was pretty sure that this was not the correct picture.” So, he and his team more thoroughly instrumented their application, which then showed separated traces, and after reaching out to the OpenTelemetry community, they saw it was an issue in how they were using coroutines.
After auto-instrumentation runs for a while, teams get a sense of what’s valuable and what’s noise. For example, if it’s not valuable to include information about service health checks, you can turn off instrumentation that tracks them. Looking back at the instrumentation you have and paring it down a little bit lets you avoid the problem of not having instrumentation for things you need while avoiding needless data generation and transmission.
3. Take advantage of a key OTel amenity: the OpenTelemetry Collector
All the roundtable panelists—and Honeycomb—agree: use the OpenTelemetry Collector. As a proxy for your telemetry, it can buffer data independently from your apps and let you manage secrets for that data in one place, which already makes it valuable for most teams.
But it also does so much more! It allows you to pull in data from other sources that may not be from your application and then correlate it with data generated from your application. The Collector also enables the export of data to multiple sources at the same time. As a result, you can compare different observability solutions side by side and choose the one that works best for you. If that choice doesn’t work out or new solutions appear, you can easily set up multiple exporters to review your options again.
The Collector is the pocketknife of OpenTelemetry, receiving metrics and trace data in a variety of supported formats, filtering it, and processing it before it goes to a backend. Pre-processing telemetry data with the OpenTelemetry Collector is an example of where it can truly shine.
Let’s say you have a healthcare application that deals in PII and PHI. You don’t want that data to get leaked anywhere, so you likely have several checks in place to make sure that doesn’t happen. The OpenTelemetry Collector can act as a line of defense against leaking this sensitive information via telemetry data. Using the Collector, you can establish an allow list for certain data and anonymize everything else. Alternatively, you can configure rules that say something to the effect of, “If this looks like a Social Security number, then remove it.” By using the Collector as a line of defense against data leaks, you can still achieve good observability and feel confident you’re doing so safely.
When describing how Jimdo uses the Collector, Cakar gave a shout out to the Collector contributor repository, which includes features and functions added by other vendors. “If personally identifying information (PII) needs hashing, we use that feature from the Collector.”
4. Avoid staying in every OTel room to review it: sample your OpenTelemetry data
The volume of data OpenTelemetry generates can be mind-crushing. A lot of that is “situation is normal” data, such as every time someone successfully adds something to an online shopping cart. But, a point of observability is to get enough “situation is normal” data to establish a baseline of good behavior and capture as much “situation not normal” data as possible for comparison. Statistical sampling of your OpenTelemetry data can turn down the “situation is normal” volume by having, for example, one successful cart addition represents 100, but it also lets you keep all the “situation is not normal” data.
Lacerte explained why Upgrade made the decision to sample: “We had a ton of instrumentation, and the lag time for querying the volume of data that we were sending out was to the point where we could not use it to debug or investigate production incidents.”
OpenTelemetry SDKs offer a basic upfront sampling scheme known as head sampling, and OpenTelemetry Collector also has a tail sampling processor, where the decision to sample a trace happens after all the spans in a request have been completed. While head sampling is often good enough for simple use cases, tail sampling is required for many production applications. However, tail sampling in the OpenTelemetry Collector today suffers from many operational concerns, and it lacks certain metadata about sampling decisions that observability backends need to count and summarize data accurately.
This set of problems is currently being addressed by the OpenTelemetry sampling committee, but in the interim there are vendor-specific solutions that you can employ. This leads us nicely to our last best practice.
5. Get sweet OTel samples: use the Honeycomb Refinery
Honeycomb’s Refinery is a trace-aware sampling proxy. It collects spans emitted by your application, gathers them into traces, and examines them as a whole. It also re-weighs samples, sending out data when a value represents 100 operations and not just one so that Honeycomb can correctly count and summarize sampled data.
Smith shared the story about how the Campspot team had a short timeframe for getting tracing data in. “We started sending output directly to Honeycomb straight from the applications without using the Collector. When Refinery became available, we used it for sampling.” Campspot now uses Collector and Refinery.
Want to tour the 5-star OTel with Honeycomb?
For more details about OpenTelemetry best practices, check out the discussion on demand and look for our next blog, which looks at OpenTelemetry in the wild and where it’s headed next.
If you want to give Honeycomb a try, sign up for free to get started.