Guides Observability Engineering Best Practices
Adopting Observability: Lessons Learned On How to Reduce Cognitive Load
15 minute read
Introduction
At this year’s hnycon, we hosted a roundtable discussion with a few of our guest speakers about the lessons they’ve learned while implementing observability and Honeycomb at their organization. The speakers included:
- Frank Chen, Senior Staff Software Engineer at Slack
- Glen Mailer, Senior Staff Software Engineer at CircleCI
- John Casey, Principal Software Engineer at Red Hat
- Michael Ericksen, Staff Site Reliability Engineer at Intelligent Medical Objects (IMO)
- Pierre Vincent, Head of SRE at Glofox
- Renato Todorov, Global VP of Engineering at HelloFresh
The discussion touched on quite a few things, but one of the main points everyone agreed on was the importance of reducing the cognitive load placed on teams when introducing observability. Renato Todorov shared his experience at HelloFresh:
“We weren’t expecting 300 people to immediately jump in when we said ‘observability is the thing,’ but we also didn’t expect such a low engagement. We didn’t consider the cognitive load that people were already dealing with. When we pushed for adoption, people were busy working on other stuff and we were just dumping them a Jira task.”
To smooth out the process of adopting observability, you need to reduce the cognitive load you place on your stakeholders by starting small, engaging people’s curiosity, and leveraging storytelling.
Lesson One: Start small and use a compelling hook
Start small with what Michael Ericksen (IMO) called a “hook.” If you think back to your high school English class, you’ll remember that a hook is the first two to three sentences in your essay that are supposed to grab your readers’ interest and give them a hint of what’s in store.
Applied to observability, your “readers” are your stakeholders and you should pick something small—in other words, something that’s easily achievable, compelling, and educational for everyone involved.
A good way to ensure your hook achieves all three is to tie it directly to business objectives. Frank Chen (Slack) recommended that you “motivate folks with their specific business problem.” For instance, what is the goal they’re trying to achieve and can observability remove obstacles?
If you don’t start small, your goal of implementing observability will likely be lost in the backlog of everything else the team is working on. Michael described the typical result of not starting small at IMO: “The engineering team [would say], ‘We’ve added the story in our backlog:
Make application observable.’ You can start much smaller than, ‘The whole thing needs to be observable.’ You just need a hook into the system.”
An example of a great hook is one Rich Anakor shared in his keynote speech about a team at Vanguard performing a migration from on-prem to a cloud repository. They were struggling for months trying to figure out all the dependencies. Once Rich’s team used observability to help pinpoint the issues, they found their answers in minutes.
This is a great hook because it was a smaller test for Rich’s team to try out observability that had an immediate impact on stakeholders. This kind of excitement and the amount of time saved is very important—according to John Casey (Red Hat), it’s key to inspiring curiosity among your stakeholders. “You don’t have to have the entire thing done to get the benefit. What we’re doing [at Red Hat] is starting in one place and trying to build our way out [from there]. You have to give people space and time to have curiosity. Time pressure kills curiosity.”
Starting small reduces the cognitive load from “implement observability” to “implement observability in this one, small instance.”
Perhaps most importantly, starting small has the potential to get people excited and curious about what’s truly possible with a full-fledged culture of observability.
Lesson Two: Engage your stakeholders’ curiosity
Curiosity makes work fun—it literally engages the dopamine pathways in your brain. Site reliability engineers (SRE) and other support roles know this all too well because it’s their responsibility to be curious when it comes to solving problems in production.
Unfortunately, this curiosity isn’t always shared among other engineers in the organization who might see implementing observability as just another Jira ticket adding to their cognitive load. In light of this, Pierre Vincent (Glofox) suggested that you should try to get a curiosity mindset going for everybody. “It’s actually a little bit of a treasure hunt. It’s kind of a game, right?”
The “treasure hunt” terminology struck a chord with the whole panel. Hunting down issues with observability is kind of like a game or, as Michael (IMO) described it in his talk, The Curious Case of the Latency Spike, a Knives Out–style sleuth.
To spread this curiosity mindset, there are a couple of practical things you can do beyond writing up a report or scheduling a meeting. Pierre recommended short videos. “A five- minute Loom video explaining some weird thing in production and how we figured it out with Honeycomb is useful. [Show] them how [you] found that needle in that haystack.”
Meanwhile, Glen Mailer mentioned that at CircleCI, chat logs were very effective. “I think [it’s powerful to be able] to see a flow of chat with the Honeycomb queries in it and then referring back to that later.”
However you share the story, make sure it’s referenceable later and that it engages your stakeholder’s curiosity. Treat observability like the treasure hunt it is and you’ll reduce their cognitive load.
Lesson Three: Combat complexity with storytelling
One way to make observability adoption easier on others is to use your storytelling skills to illustrate complex topics and educate stakeholders.
Because you’re starting small, you can build up examples of how observability benefits the business, not to mention the daily lives of engineers. Spinning these out into simple stories makes the technical aspects of observability easy for everyone to digest.
Frank gave a great example during the panel (which he explained in more detail during his talk) where his team implemented their first cross-service trace on the second day of a multi-day cascading failure incident. They were able to solve the incident quickly with a cross-service trace when other efforts failed. Afterward, he said this situation really helped build a lot of interest in how other teams inside of Slack could adopt tracing and use this tooling.
This single, focused application of tracing gave Frank something specific to point to when explaining the importance of observability without having to go into details about how it works.
A side effect of these stories is that you’ll start to build a library of anecdotes and metaphors, like Pierre’s “treasure hunt” and Michael’s “mystery.” In Frank’s case, he came up with this explanation for the value of Honeycomb:
These stories, anecdotes, and metaphors make it easier to internally champion the benefits of Honeycomb and observability over time, even among people who aren’t directly involved.
With storytelling, you can reduce the cognitive load on your stakeholders by making it easier to understand the