Negotiating Priorities Around Incident Investigations

Incident Response

By: Fred Hebert | February 29th, 2024

Incident Response

7 Min. Read

Contents

Investigation types

Deadlines and public relations

What we do in the shadows

But what if there are too many to choose from?

Conference Talks June 10, 2021

How Honeycomb Manages Incident Response

Fred covers the full incident lifecycle at Honeycomb, all the way from first detecting issues to resolving them. But most of the effective practices we implement come from work that happens before and after those incidents. You'll also learn about systems we use at Honeycomb that can also help you implement better incident response with your teams.

LEARN MORE

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions with multiple stakeholders. The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

Investigation types

Incident investigations, reviews, and reports play multiple roles. The below bullet list is inspired by Sidney Dekker’s The Psychology of Incident Investigations (short annotated version), where he breaks down four roles that are assigned:

Moral: explain transgressions, and reinforce moral and regulatory boundaries. These tend to refer to norms (what you should or shouldn’t do), and deviations from norms and processes are often seen as contributing factors.
Existential: explain the suffering that occurred. These assume that incidents are not supposed to happen, and seek ways to reassure people that they do not have to happen either.
Preventative: explain how to avoid recurrence, ask for alterations. These seek explanations that can identify variables on which we can act to prevent similar incidents from happening again.
Epistemological: explain what happened, causes, and effects. This approach works best when it represents multiple viewpoints to paint the richest picture possible, including even contradictions where truth can’t be found.

There’s tension here because the Moral and Existential approaches can clash with others. They may search for transgressions or improper behavior, and this may make the Preventative approach more challenging by obscuring or hindering investigation paths. People are less likely to contribute to investigations when they fear reprimand, for example.

The Epistemological approach can be in tension with the Moral and Existential types for similar reasons, but can also clash with the Preventative type. The objective of preventing recurrence may force you to put on blinders when everything worked as designed and that the whole situation might have been an inevitable—or even acceptable—tradeoff to the system. Some fixes, done because you need to fix something, anything, might be ineffective, misleading, or harmful.

The approach I personally favor is always the one that centers on learning (Epistemological), with the belief that when you have good explanations, you can surface preventative approaches as well.

If you find yourself with management, users, customers, or peers looking for a Moral outcome, you should ready yourself to see them consider your reviews a failure for not properly reinforcing expectations on professionalism or ownership. Shaping these expectations becomes groundwork in order to properly do Epistemological or Preventative work, and differs for internal and external stakeholders.

See how we handle incidents at Honeycomb.

Deadlines and public relations

Customers sometimes demand quick analyses after incidents: a post-mortem, a root cause analysis (RCA), or other public report. In theory, this aligns well with incident investigations: the longer you wait, the more likely it is that participants will forget key details. Ideally, you want to start as soon as possible. A good in-depth investigation that truly tries to understand what was going on will, however, take far more than two business days to investigate: anything you promise within this delay is guaranteed to be superficial and not that useful.

Public reports have their own purposes, and distinct audiences. It is quite possible that while you want Epistemological investigations internally, public reports will be Moral by showing you’re taking the situation seriously, or Existential by acknowledging the pain customers feel.

If your public report is also expected to be a source of preventive measures or explanations for users and industry peers, then these objectives might once again clash. A report produced rapidly can do the public relations role of apologizing and appeasing your users, but is unlikely to do a decent job for learning.

These use cases, while conflicting, are not all invalid. In fact, at Honeycomb, we’ve sometimes opted to publish multiple reports. Here are some things we’ve tried for minor incidents and serious outages:

The status page, which describes all public-facing incidents that hit a significant portion of users. For minor incidents, this may be the only report written.
A preliminary report, which is written within that two to three day period after a major incident. It provides a quick description of what we think happened. If the incident is particularly interesting to us—or to our customers, often due to its severity—we note that a follow-up in-depth investigation will take place.
An in-depth internal review (often with its own report), which may take weeks of on-and-off time to prepare and write.
An in-depth public report, which is based on our internal report. We redact names, implementation details, some bits of history, project roadmaps, social elements, and other similar content. The criteria here is, “Do we think our customers—or people elsewhere in the industry—could learn something useful from this?”
A short blog post, which is a whittled-down version of the aforementioned report.

This distillation of information into multiple formats hits the mark for different stakeholders. We expect this balance to keep shifting as we grow and as our user base gets more diverse.

What we do in the shadows

We try to encourage learning from our incidents. Lots of groundwork (before my time as well—this isn’t something that started with me) was established to make that a possibility. To “protect” that ability, we’ve accepted that we need to write different reports for different audiences, which—luckily—we can alter without losing our internal approach and benefits.

Our focus on learning also has an interesting rule of thumb attached to it: we don’t review all incidents. The guideline is that we prefer to have a few in-depth reviews than surface coverage of all incidents. Pick and choose the incidents in which you’re going to dive deeper:

Choose incidents where folks are surprised, or even say out loud “I want to review this” or “this is a really weird one.” They’re strong signals that these incidents are good learning opportunities.
Rare occurrences of weird incidents are worth jumping on at a higher priority; common incidents are probably going to happen again and we can learn from them next time. This is a bit counter-intuitive because we tend to think in terms of clearing up the most common elements first, but aiming for qualitative dive flips this idea around.
Large incidents with public-facing impact are generally worth reviewing. If external stakeholders want to know what happened, we should try to learn something from the incident as well.

But what if there are too many to choose from?

Let’s hope you’re never in this situation, but if you have too many incidents to choose from, conduct a meta-review where you consider all of those incidents to be an extended outage period.

How did this high-intensity period feel for your people?
Are there patterns?
What can you learn from these high-level patterns without necessarily digging deeper into the individual outages?

Answering these questions might help you refine how you handle incidents.

We’re curious: what’s your current approach like? If you had a magic wand and you could fix one thing immediately, what would it be? Where would you find the most impact? Join the conversation in Pollinators, our Slack community.

Don’t forget to share!

Fred Hebert

Staff Site Reliability Engineer

Fred is a Staff Site Reliability Engineer (SRE) who has worked as a software engineer for over a decade and ended up with a healthy dislike of computers and clumsy automation. He’s a published technical author who loves distributed systems, systems engineering, and has a strong interest in resilience engineering and human factors.

Fred Hebert | Jan 28, 2025

Restructuring How We Think About Alerts

Back in Alerts Are Fundamentally Messy, I made the point that the events we monitor are often fuzzy and uncertain. To make a distinction between what is valid or invalid as an event, context is needed, and since context doesn’t tend to exist within a metric, humans go around and validate alerts to add this context. As such, humans are part of the alerting loop, and alerts can be framed as devices used to redirect our attention.

Incident Response Operations

Fred Hebert | Nov 04, 2024

Against Incident Severities and in Favor of Incident Types

About a year ago, Honeycomb kicked off an internal experiment to structure how we do incident response. We looked at the usual severity-based approach (usually using a SEV scale), but decided to adopt an approach based on types, aiming to better play the role of quick definitions for multiple departments put together. This post is a short report on our experience doing it.

Incident Response Teams & Collaboration

Fred Hebert | Sep 30, 2024

Syncing PagerDuty Schedules to Slack Groups

We’ve posted before about how engineers on call at Honeycomb aren’t expected to do project work, and that whenever they’re not dealing with interruptions, they’re free to work on whatever will make the on-call experience better.

Incident Response Teams & Collaboration

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Negotiating Priorities Around Incident Investigations

How Honeycomb Manages Incident Response

Investigation types

Deadlines and public relations

What we do in the shadows

But what if there are too many to choose from?

Fred Hebert

Related posts

Restructuring How We Think About Alerts

Against Incident Severities and in Favor of Incident Types

Syncing PagerDuty Schedules to Slack Groups

Ready to get started?