A CoPE’s Duty: Indexing on Prod

A CoPE’s Duty: Indexing on Prod

6 Min. Read

Odds are that a software engineer today is really focused on one place: pre-prod. Short for “pre-production,” this is slang for an environment where software code operates in a prototype phase of its development lifecycle. 

Common sense would have one believe that this is a safe space, a workbench of sorts, where problems can be found and remediated. Then, once engineers are reasonably certain everything’s working properly, they advance it to a matching environment called production, where the code behaves like it did in pre-prod and it merely needs to be managed by an operations team.

That story is a comforting lie.

The problem

A wise woman once said: “I always test my code and then I test it in production, too.” The truth is that code in prod and code in another environment may look the same, but behave differently. Prod and “lower environments” are purposely different because of the social aspects of their existence as sociotechnical systems. What does that mean?

Consider load testing. In a pre-prod environment, the traffic used to test performance is generated and runs along code paths that are executed using synthetic data created by an organization’s employees. In prod, the traffic is generated by actual customers and users. Now sometimes, the employees can also use prod—but most of the people using it won’t overlap with that category. So they’re mostly different people who aren’t reducible in their knowledge, mental models, interests, etc. to the employees. What they do with the code will also be different and will change over time, so they’ll affect the system in different ways.

The tests run in lower environments will be different from what actually happens in prod. This means teams need to shift their focus to prod and build up their experience and tooling to effectively support it.

This contravenes much of the history of software engineering. So many of the different methodologies and technologies treat user behavior as disruptive and destructive. The underlying assumption is that the goal of engineering is to produce a system that works, and that’s so difficult to achieve that hermetically-sealed workspaces are required. But if the working system is so brittle that it can’t handle what customers and users do with it, then it’s as good as worthless to the organization. Organizations don’t just set a normative standard of behavior for their customers and users. They also serve customers and users. In other words, they dynamically co-create each other.

The solution

The way forward here is to treat prod as a sense organ for the organization, like a person’s ears or nose. It’s a medium for receiving information that the organization can process and then turn into definite actions that engineers and product are particularly attuned to. A CoPE therefore needs to ensure that these groups have the right signals coming into Honeycomb and that they’re transformed and shared with the rest of the org.

Some of this will look like what we’ve discussed before in terms of instrumentation and alerting. However, there are distinct interventions that a CoPE should consider if they find that the organization implicitly values work in lower environments and that it’s necessary to shift their colleagues’ center of gravity to prod.

The method

First things first: developers need to understand what effects their code deploys and releases have. Once they have that info, then they can circulate it amongst their team, the eng org at large, and finally, find ways to share it with the broader organization.The key to succeeding is to start small and let iterative changes compound. 

Honeymarkers

One track that a CoPE might take is to begin adding Honeymarkers automatically as code deploys. These appear as annotations on visualizations in the Honeycomb UI and can include information like the deploy ID or commit ID. They’ll allow anyone to correlate behavioral changes with deploys, making it easy to see if a code change had the desired effects or if it needs to get rolled back. 

Custom instrumentation

Then, a CoPE could help teams to add that same ID as an attribute via custom instrumentation so they can actually include it in their query parameters. That permits triangulating between the ID, the marker, and any other parameter(s) serving as the dependent variable(s).

Once those are in place, it becomes very easy for a team to put that PR “show me” that was suggested in a previous article. The Honeycomb feature that makes this work and serves as a documentary papertrail is the URL permalink, which allows anyone to revisit query results indefinitely. 

Sharing out to the wider org

Finally, engineers can begin sharing their changes in places like sprint reviews, departmental allhands, and even company-wide events like demo days. 

These practices all reinforce the idea that engineering is working on things that improve the experience of the system’s users and that align with the organization’s goals. They also break down knowledge silos, because explaining what a change is and why it was done requires cross-functional contextualization (i.e., storytelling). Furthermore, they give everyone a chance to celebrate the excellent work they’re doing and express appreciation for one another’s achievements.

Additional steps beyond this point might include utilizing feature flags to distinguish between deployed and released code, or a mechanism like GitHub Actions Deployment Protection Rules to gate changes on the results of Honeycomb queries.

Conclusion

These, and other techniques, grant teams and organizations the capacity to treat prod as a veritable source—and medium—of information. It’s not just an edge, but also a node in a network; in fact, it’s several edges and nodes. In other words, it’s complex. 

Maintaining this metastable system requires continual intervention and modulation. Too much focus on pre-prod hampers those efforts and dulls the senses. It’s much more valuable for organizations to enable and value engagement with prod.

This brings me to the final point of this series, to be discussed in the next post: how and why organizational leadership must play a role in a CoPE’s success.

In the meantime, some Honeycomb experts have begun sharing their thoughts and experiences with adopting observability 2.0. Check out their videos, and don’t forget to sign up for Honeycomb if you haven’t already!


Get your free Honeycomb account today.


For a list of the prior posts in the CoPE series, see below:

Pt. 1: Establishing and Enabling a Center of Production Excellence

Pt. 2: Independent, Involved, Informed, and Informative: The Characteristics of a CoPE

Pt. 3: Staffing Up Your CoPE

Pt. 4-1: The CoPE and Other Teams, Part 1: Introduction & Auto-Instrumentation

Pt. 4-2: The CoPE and Other Teams, Part 2: Custom Instrumentation and Telemetry Pipelines

Pt. 5: A CoPE’s Guide to Alert Management

Don’t forget to share!
Nick Travaglini

Nick Travaglini

Senior Technical Customer Success Manager

Nick is a Technical Customer Success Manager with years of experience working with software infrastructure for developers and data scientists at companies like Solano Labs, GE Digital, and Domino Data Lab. He loves a good complex, socio-technical system. So much so that the concept was the focus of his MA research. Outside of work he enjoys exercising, reading, and philosophizing.

Related posts