The CoPE and Other Teams, Part 1: Introduction & Auto-Instrumentation

The CoPE and Other Teams, Part 1: Introduction & Auto-instrumentation

8 Min. Read

The CoPE is made to affect, meaning change, how things work. The disruption it produces is a feature, not a bug. That disruption pushes things away from a locally optimal, comfortable state that generates diminishing returns. It sets things on a course of exploration to find new terrains which may benefit it more—and for longer.

Laurent Hébert-Dufresne and his co-authors produced a model of social organization. That model shows the transmission of changing behavioral norms, which can result in an overall bolstering of fitness. We take this model as our guide. Based on their research, we believe that only through coordinated and mutually reinforcing changes can an organization successfully achieve its goals. These changes should promote prosocial behaviors which decenter short-sightedly individualistic achievement.

The following sections consist of recommendations for a CoPE. A CoPE may use these to drive towards those desired outcomes; they especially focus on observability and the effective use of the Honeycomb product. The recommendations feature several socio-technical reforms at the level of individuals and teams (the bottom of the organization’s formal hierarchy), and institutional and policy changes at the management level (the top of the formal hierarchy).

We’ll cover “bottom-up” and “top-down” practices, which should mutually reinforce each other. The defining characteristic of bottom-up practices is that they diffuse through an organization’s network of frontline practitioners. Meanwhile, the telltale sign of top-down practices are that they are interventions that introduce an accelerant (or a dampener, as the case may be).

Additionally, the CoPE can interact with their colleagues in the lower position in the hierarchy in two different ways. The first is what I term “passive,” and the second is “active.” Each of these “modes” of engagement should affect the behavior of the team. However, the distinction between the two is how they affect them. The passive mode consists of the team receiving materials produced by the CoPE and incorporating them into their existing habits and dispositions themselves. The active mode consists of the CoPE inciting breaks from existing habits and dispositions.

The long-term impact of these interventions should be “cooperation,” which Hébert-Dufresne et al define as “behavior that carries group-level benefits.”

Telemetry instrumentation

The foundation of good observability is the instrumentation of your software. This instrumentation takes the form of additions to your services’ source code, either as software packages like libraries or new code handwritten by developers.

Our general recommendation is that folks instrument all of the services available to them in order to achieve at least basic visibility into their system’s behavior and performance characteristics. The next step is to customize their instrumentation to emit more detailed and information-dense telemetry. This instrumentation should produce what are, effectively, structured logs in JSON files. Each of these files is the content of what Honeycomb refers to as “events.”

Honeycomb, as an observability tool, allows developers and other interested parties to analyze that telemetry data by means of “queries” and to model their software system. Its columnar datastore and query engine allow its users to write new queries without needing to pre-aggregate data or to index it in advance. Furthermore, Honeycomb’s model encourages users to create what are termed “wide” events that make use of “high cardinality” values.

Events

Wide events are structured logs with lots of dimensions; Honeycomb supports events with up to 2000 dimensions at time of writing. Each dimension consists of a Key:Value pair. That Key is an “Attribute” or “Field” and is analogous to a column header in a spreadsheet. The more dimensions in a dataset, the more possible ways that one can segment and analyze the data.

Cardinality and dimensions

High cardinality refers to a quality of the data, specifically describing the values in each dimension. High-cardinality data is more detailed than low-cardinality data. What makes something “high” or “low” cardinality is the elements in a set. A set with only three elements has lower cardinality than a set with 100000; the value True in a set of {True, False, null} is less detailed than the value 20109 in a set of {0, 1, 2, …, 99999}. The latter is more distinctive and distinguished, and therefore, more informative.

For example, consider two events consisting of the same two dimensions. One dimension records whether a user is logged in and its data type is a Boolean, while the other records their user ID and is a string. Now suppose that each user logs in at the same time. The events will only be distinguishable by the user ID value. Building upon this, suppose that instead of two users, there are two thousand and all of them log in at the same time. All of those events are only distinguishable because of the higher cardinality set of possible user IDs. That greater distinguishability is what makes dimensions with higher cardinality data more informative, in the sense that dimensions help us answer more questions, in the form of: “Is user X logged in, yes or no?”

With their powers combined…

Together, these serve as a basis for comparing the “information density” of different events. Wider events with higher cardinality values are more useful for analyzing system behavior because they enable finer-grained distinctions between segments and more flexibility in the scale of analysis. In other words, one can twist and turn the model in more ways and zoom in and out to a greater extent.

In the following sections, we’ll discuss two ways that Honeycomb recommends instrumenting your software, each with OpenTelemetry (OTel). The first makes use of OTel’s auto-instrumentation capabilities. The second builds upon OTel with custom work tailored to your system.

OpenTelemetry auto-instrumentation

The open-source OpenTelemetry (OTel) project offers a wide array of libraries and a robust suite of other tools which are useful for instrumenting software. Its documentation notes three native methods which a CoPE can utilize. Since instrumentation depends on your organization’s services, this section will only briefly touch on this aspect. Instead, the focus will be on the CoPE’s strategy for growing adoption of the span-and-trace-based mode of observability, distributed tracing—OTel’s sweet spot.


New to OpenTelemetry? Read the guide and expand your knowledge.


The problem

In our experience, many developers struggle to adjust to tracing. This may be due to experience with metrics or (unstructured) logs— they have trouble breaking from expected design patterns. Or it may be that they haven’t yet switched from prioritizing their system and its components (which is what many tools focus on) to prioritizing their customers’ or users’ experience. Or perhaps they simply don’t grok how to connect the data analysis part to system performance.

The solution

The first way to address this problem is to auto-instrument everything. Every service. Every proxy. Every library. 

CoPE: Instrument all the things!

(Ok, some things like ColdFusion monoliths can’t be instrumented. But do the things that can!) 

Developers will want access to data that represents what’s happening with the services that they immediately work on or have responsibility for. Without that, there’s no way to help move them to care about production excellence.

Once that data is available in Honeycomb, the next step is getting developers to look at and make use of it. To that end, we recommend holding Introduction to Honeycomb trainings as one active tactic, and to build in structural constraints to existing processes which create the conditions for passive adoption.

The method

The precise constraints are, again, context dependent on the specific patterns of behavior that the CoPE wants to inflect. In a previous Honeycomb blog post, we discussed a few ideas about common types of social institutions that might be appropriate candidates and how to change them. Here we add another to the list of examples:

Pull request “show me”

Many software development organizations rely on Git-based technologies. A frequent pattern is to use pull requests (PRs) to gate merging new code on a peer review, and part of submitting that PR is completing a templated form explaining the rationale and important details about the proposed code change.

One thing that a CoPE may do is to modify this template to include a section requiring a Honeycomb query. The query should display the before, and running the query again once the code change is merged, should display the after. This will let everyone involved check how the code change has ‘moved the needle’ and affected the customer experience. Think of it as analogous to unit testing.

This benefits adoption because it requires the developers to learn how to represent their system via Honeycomb’s query builder and to focus their attention explicitly on how the production system’s characteristics impact customers. It also documents this for future reference and makes it transparent to all involved.

But wait, there’s more!

Auto-instrumentation is a great start, but that’s all it should be—a start. Join us next week as I dive into custom instrumentation and telemetry pipelines, and how the CoPE can affect change there. 

If you missed the first few blogs, here’s a list for you: 

Part 1: Establishing and Enabling a Center of Production Excellence

Part 2: Independent, Involved, Informed, and Informative: The Characteristics of a CoPE

Part 3: Staffing Up Your CoPE

Don’t forget to share!
Nick Travaglini

Nick Travaglini

Senior Technical Customer Success Manager

Nick is a Technical Customer Success Manager with years of experience working with software infrastructure for developers and data scientists at companies like Solano Labs, GE Digital, and Domino Data Lab. He loves a good complex, socio-technical system. So much so that the concept was the focus of his MA research. Outside of work he enjoys exercising, reading, and philosophizing.

Related posts