The CoPE and Other Teams, Part 2: Custom Instrumentation and Telemetry Pipelines

The CoPE and Other Teams, Part 2: Custom Instrumentation and Telemetry Pipelines

9 Min. Read

The previous post laid out the basic idea of instrumentation and how OpenTelemetry’s auto-instrumentation can get teams started. However, you can’t rely only on auto-instrumentation. This post will discuss the limitations in more detail and how a CoPE can help teams overcome them.

Custom instrumentation

Once your teams begin working with telemetry from auto-instrumentation, they’ll soon realize something: it’s reeeeally barebones. While the specifics vary by language and library, most of the official OTel options are generic. They may provide information in a span about things like http.method, http.status_code, or the name of a unit of work, but that’s a far cry from anything specific to your business case. Auto-instrumentation is good for basic information, and because it includes the pieces needed to construct a trace-like span hierarchy.

However, to get fine-grained details about your system’s behavior, your teams will need custom instrumentation—but improving telemetry in this way is a “wicked” problem, meaning there is no solution that doesn’t leave some problematic remainder (did someone say tech debt?).

The problem

Organizations want high-fidelity data about what their system’s doing because it’s more useful than low-fidelity data. In Honeycomb, that means that they want wide events with lots of dimensions (key-value pairs) and high-cardinality data for values. 

Auto-instrumentation provides narrow events with some high-cardinality dimensions and a set of industry-standard semantic conventions. To get better data requires additional work. It takes time for a team to decide on the appropriate units of work, to add the instrumentation code into the main business logic, and to put in the creative effort to come up with the new fields to write in. Furthermore, teams writing their own instrumentation may deviate from one another in the semantics that they use when creating those new fields. These deviations can make it difficult to establish and maintain common ground between individuals and teams.

Finally, it’s impractical to attempt to instrument everything, given the tradeoff between the time devoted to instrument code vs. the return of value. This indeterminacy is similar to the halting problem in the theory of computing: there’s no clear way to know that one has sufficiently instrumented their code. If that’s the case, when conditions change for the business (e.g., new services are added corresponding to new business imperatives), then the instrumentation will need updates.

The solution

In light of these challenges, most organizations only see a mountain of work. However, our guiding model suggests that there is a tipping point where this goes from active effort to passive habit. It’s the role of the CoPE to get their organization to—and through—that point.

The way to go about this is two-fold: 

  1. make the approach to the summit easier to traverse, and 
  2. give the hikers the appropriate gear and provisions for the trek.
CoPE: Make the approach to the summit easier to traverse, and give the hikers the appropriate gear and provisions for the trek.

The method

One of the best ways to address this challenge is with SLOs! Honeycomb’s SLOs can be built upon data already produced from auto-instrumentation, but teams often find that auto-instrumentation doesn’t provide enough insights to really keep the parts of the system as reliable as they’d like. It also may not provide the information that partners like the product team want when using SLOs to make choices about prioritizing new features or reliability work. If enough people agree that it’s not sufficient, the CoPE can step in.

The first step will be to convince the product team (and other parts of management) that the lack of information means that they need to invest time in instrumentation. Reliability isn’t just about whether the team is meeting its SLOs today—it’s about if they’ll meet them in the future, too. Having good enough instrumentation to even make that determination is reliability work.

Once time has been allotted, then comes the fun part. A CoPE can start to address #1 (make the approach to the summit easier to traverse) by defining a set of semantic conventions. Martin Thwaites advises creating some that are flush with the OTel conventions when they abut (e.g., when adding fields related to HTTP traffic, make sure to prefix http.*). From there, the conventions should make sense locally, whether that’s relative to the teams or the larger organization. The CoPE will need to draw upon their own background working within the system and the thoughts of their colleagues to build this out.

With these conventions as axioms, the next step is to determine what exactly should have additional instrumentation, and what context should be included. As noted above, this will change over time. It’s the role of the CoPE to help teams create these and to figure out when they stop being effective.


What’s better than Honeycomb?
Honeycomb and OpenTelemetry together.


To help teams start, the CoPE should build a custom library for each of the languages in use through the organization. These will supplement the auto-instrumentation already in place. The initial version will again draw on the CoPE members’ experiences and informal contributions from their colleagues. Later revisions should be driven by the teams that rely upon the libraries. But if no team “owns” the resource, then who will do that work? Won’t it fall prey to the “tragedy of the commons?”

Fortunately, that scenario is a fictionalized, simplified ideal—and in fact is largely an edge case, as Elinor Ostrom argues. The key to both the governance of this common resource—and of prompting teams to improve it—is to create forums where it is discussed as the solution to a problem. This helps frame it as something worth putting effort into. The ideal setting for this is during post-incident reviews.

The type of post-incident review meeting that the CoPE should strive to create is the one advocated by the Learning from Incidents community. In those spaces, incident analysts lead discussions amongst incident responders (and other interested parties) where people explain what they did to respond and how it made sense for them to do those things. It is a chance for “frontline” people to share the expertise earned through normal work in a psychologically safe space, for others to learn about how their colleagues and technical components really work, and to contribute to a discussion regarding where instrumentation was insufficient or outdated. 

The CoPE can take this as feedback from the socio-technical system and work with the teams that depend on the libraries to update them, eventually handing this off to the teams completely. Beyond that, and based on what they now know they need, individual teams can add any instrumentation to the code’s nooks and crannies that even the libraries can’t reach.

Telemetry data strategy

An unsung hero in reliability is the pipeline that sends telemetry data to whatever backend analyzes it. While an SRE or platform team will probably take on the bulk of the work managing it, a CoPE can make a few key contributions to this crucial flow of data.

The problem

One of the great challenges that businesses face is managing the telemetry data their systems produce. There are a host of things to take into account like access control, availability and latency, content (e.g., user privacy and data hygiene), volume, and more. Honeycomb helps make sense of the data it receives, but that’s conditional on what actually makes it to the tool. 

The heart of the problem is that many organizations don’t recognize that different data has different value to different teams. For example, sending events with lots of PII can be extremely valuable to an organization because it’s high cardinality and thus lets the team scope investigations quite tightly. However, certain regulatory environments encourage finding other means to achieve that scoping, so if the organization begins to do business in such an environment, then the PII’s value drops considerably.

The solution

A comprehensive solution would be too involved for this post. Nevertheless, a CoPE can begin by investigating and creating processes for sharing the value of different data throughout the organization. Why is this the place to start? Glad you asked!

Engineering is tasked with managing tradeoffs in the pursuit of a goal. In this case, telemetry data isn’t equally valued by all, so the relevant hierarchies of value must be accounted for in order to understand the ways they conflict. Once those conflicts are known, then the parties involved can negotiate towards the pursuit of their objective(s). Those negotiations produce a data strategy, which can serve to determine suitable tactics and techniques. Thus, a CoPE can facilitate the creation of this strategy and the derivative tactics that the organization will use.

The method

In a certain sense, a CoPE has already begun this work by its very nature. As noted before, drawing staff from across the organization will bring local knowledge about these values into dialogue immediately. So a CoPE can begin developing a data strategy with its own constitutive expertise.

However, it can’t end there. Other people need to be interested in how the organization works and resist the drive to myopic focus on just their own tasks and team. The organization needs incentives for curiosity. Once those are in place, conducting workplace studies into the values of a locality and sharing the results of that research is a fine way to circulate knowledge.

Up next: Alerts!

The data you produce will have a massive impact on how you understand your software systems, and what tools your organization has to make sure that your customers have an excellent experience. In this post, we looked at how a CoPE can support producing good data. In the next one, we’ll look at alerts, and how Honeycomb’s features enable cohesive teamwork.

Don’t miss the previous posts in this series:

Pt. 1: Establishing and Enabling a Center of Production Excellence

Pt. 2: Independent, Involved, Informed, and Informative: The Characteristics of a CoPE

Pt. 3: Staffing Up Your CoPE

Pt. 4-1: The CoPE and Other Teams, Part 1: Introduction & Auto-Instrumentation

Don’t forget to share!
Nick Travaglini

Nick Travaglini

Senior Technical Customer Success Manager

Nick is a Technical Customer Success Manager with years of experience working with software infrastructure for developers and data scientists at companies like Solano Labs, GE Digital, and Domino Data Lab. He loves a good complex, socio-technical system. So much so that the concept was the focus of his MA research. Outside of work he enjoys exercising, reading, and philosophizing.

Related posts