Against Incident Severities and in Favor of Incident Types

By: Fred Hebert | November 4th, 2024

Incident Response Teams & Collaboration

9 Min. Read

Contents

Conference Talks June 10, 2021

How Honeycomb Manages Incident Response

Fred covers the full incident lifecycle at Honeycomb, all the way from first detecting issues to resolving them. But most of the effective practices we implement come from work that happens before and after those incidents. You'll also learn about systems we use at Honeycomb that can also help you implement better incident response with your teams.

LEARN MORE

About a year ago, Honeycomb kicked off an internal experiment to structure how we do incident response. We looked at the usual severity-based approach (usually using a SEV scale), but decided to adopt an approach based on types, aiming to better play the role of quick definitions for multiple departments put together. This post is a short report on our experience doing it.

Problem statement

Incident priorities play the role of boundary object: a concept that is shared across various portions of an organization, but means different things and is used for distinct purposes by different people and departments.

The classic approach to incidents is to pick incident severity levels, often on a grid a bit like this:

SEV	Example Description	Example Response
1	Critical issue. Systems down. Impacts a large amount of or all customers. SLAs not being met. User data loss. Security issue where customer data is at risk. Organization survival at risk.	Public communications required. Executive teams involved. All hands on deck. Formal incident investigation and public reports expected.
2	Major issue. Severe impacts to a large number of customers. SLAs at risk. Incident response itself is affected. Product is unusable or inaccessible.	Public communications required. Internal status updates required. Formal incident command is activated. Subject matter experts across teams brought in. Formal incident investigation and public reports expected.
3	Minor issue. Some systems are not working properly, but most are. Some customers are impacted.	Public communications required. Internal status updates required. The team owning impacted systems are part of the response. Internal incident investigation expected.
4	Low impact. Informational notice. Performance may be degraded in some services. Some customers may see an impact.	Public communications may be useful if they can’t be handled directly to impacted customers. Mitigations can be worked on by on-call people if rollbacks aren’t adequate.

Some scales will vary in how many levels they have, on how automated their responses are, and in the criteria they include.

The challenge is that they offer a single linear scale that encompasses many elements:

Impact: number of services, performance and uptime implications, accessibility and usability criteria, contractual obligations.
Organizational factors: ownership definition, departments with different compliance requirements, budgets, and schedules.
Workload: types of communication, incident command structure, number of people involved, expected duration, investigations after the event, type of follow-up and action items (and urgency), types of procedures (such as “break-glass” situations).

Because severities are a single linear scale, they also create fuzzy boundaries and corner cases when elements ranging across severity levels are in a single incident:

What if a single customer has their data exposed or lost?
What if there’s performance degradation, but it’s system-wide and strategic customers are complaining and threatening to leave?
What if the impact is currently not visible, but if nothing is done within a day or two, it will be a SEV-1 incident?

These fuzzy boundaries are necessary when we are trying to trade off the complexity of the real situation with straightforward response mechanisms across an organization. However, they sure are messy.

Why it’s a mess

Each role offers a distinct perspective, asymmetric information, a varying set of priorities, and different tools—but is brought in line by a single event. The variety of stakeholders and people involved into a single scale like that means that it’s leveraged inconsistently across the organization:

An engineer who believes the impact to be wide but easily fixable may want to underplay the severity to avoid roping in external departments or activating machinery that will slow response while it ramps up and propagates to the organization. Responders may also try to delay formal response because they want to be very sure before activating it at full scale.
Someone who knows no customer-facing impact currently exists, but who needs org-wide support to resolve a looming threat may declare a higher severity incident to get the hands they believe they need.
Someone in a more customer-facing role may understand their pain more directly and argue in favor of a higher severity to properly communicate how serious the organization believes this is.

It isn’t necessarily problematic that we use a compressed set of terms to define incidents and their response. The issue is that we use a rather non-descriptive linear scale: a lower number means a more severe incident. The definitions are loose and depend on the stakeholder’s stance within the organization, but the responses are very real.

People will perceive the severity both as a norm to respond and as a tool, at the same time, and will also have different tolerances for disruption caused by the chosen severity, and the threshold between them. In case of ambiguous situations, some will default to noisier responses while others will wait for more validation before making that call—particularly if the severity of an incident is part of metrics and targets to be met or part of the organization for a given reporting period.

Experience shows that the outcome is accompanied by an ongoing tension between descriptive severity (based on the fuzzy impact and scope), prescriptive severity (what is the response we think we need), mixed in with time pressure, incomplete information, and objectives. The more we encompass in the severity mechanism, the trickier it is to navigate—and the longer the debates.

TL;DR: Response severity is different for every team

The response required by engineers may be more demanding for a non-customer impacting “ticking time bomb” situation or critical security vulnerability patch than it is for a higher-severity incident that requires a rollback.

For customer success teams, the breadth of impacted users and the magnitude (how unusable the product is) will define demand differently.

For incident commanders and managers, it is possible that broad involvement over long periods, at a high pace, will be more challenging than a one- or two-hour-long heavy session with a few subject matter experts.

The severity scales are linear, from lowest to most critical impact, but the responses for each type of stakeholder is not linear.

What we chose to do

This “tangling” effect is not necessarily avoidable by virtue of severity being a compromise across many stakeholders and viewpoints. I do not believe that approaches like “enforce the rules harder” or “make the guidelines even more complete” are effective. People do what they believe is useful, and that includes taking shortcuts and leveraging norms as tools.

If part of the challenge is that the severity scales are inherently linear and the response to incidents does not align with this linearity, then we may want to conclude that they are not a great pattern with which to orchestrate incident response. Instead, we may want to find a better pattern (even if not perfect either—it can never be) by ditching the severity and going for more descriptive types.

We decided, for example, to declare incidents that match any of the following terms:

Ambiguous: Not fully sure. Things look bad, but not fully broken (this is our default type).
Internal: Chaos experiments, rollouts, non-customer impacting things, tests.
Security: Security incidents have distinct workflows and compliance requirements.
Time bomb: Internal, but we need a lot of hands on deck if we don’t want it to get worse.
Isolated: One or few customers, over limited components.
Major: Key features or important customers are definitely impacted. Warrants a big response.

Being able to pick words lets people create workflows based on a description that correlates to surface, workload, and magnitude. Having too many words is likely to be as confusing as not having enough, so we want the number to be low enough. To pick adequate words, we went back into our history of incident response (not just public-facing incidents) to see if we could come up with archetypes that describe most of what we had to deal with—the expectation is that over time, we’ll need to adjust them based on newer or more frequent experiences.

The key point is that while these terms are ambiguous on purpose, they carry meaning. An “ambiguous” incident is hard to describe, but everyone has an idea what it means—especially when it’s bounded by “isolated” on the lower spectrum and “major” on the upper spectrum. Nobody’s necessarily mad at someone picking “ambiguous” as a value, which is often a useful property: the mess and uncertainty of the description is a feature of it all, not a bug. Clarity comes over time and people should understand that.

An “isolated” incident could very well be a major outage on a key feature that messes with a lot of customers, but that is fully understood to be a bad deploy, as much as it could be a thing that minorly impacts one customer but with a big financial incentive to keep them happy.

The ongoing outcome

We’ve also used these types as the foundation for workflows we’ve built with Jeli, the incident management and analysis software we’ve used for a few years now. Using these types and their custom workflows, security incidents come with private channels, and they automatically invite members of the security team into the incident channel, along with one person from platform leadership. Any public-facing (i.e., customer-impacting) type automatically invites representatives from our support team. Any non-internal type brings up incident response folks and appropriate leadership to help support the situation.

My point being, these descriptions carry meaning that goes further than a sliding scale, and that lets people subscribe or get involved based on what the term means rather than how bad it is (where “how bad” is defined differently for every department). We’ve managed to avoid creating more incident types than what we had.

Our goal is to create as clear of an incident as we can, as fast as possible. The definition helps pick a term based on what we see (or don’t know yet) more than anything, and has felt useful to most engineers we polled at the end of the experiment’s period.

A year later, we’re still using incident types as a mechanism. Unsurprisingly, most incidents fall into the ‘ambiguous’ category. This fits as designed, although our tooling doesn’t let us change types after an incident started. This has come up as the biggest hurdle of our current system, restricting how much flexibility and accuracy we’d get out of it.

We’re always on the lookout for more improvements (e.g., synchronizing PagerDuty schedules to Slack aliases makes automation useful). If you have insights on useful approaches, let us know in Pollinators, our Slack community.

Observability that helps you solve issues before they impact customers.

TRY NOW

Don’t forget to share!

Fred Hebert

Staff Site Reliability Engineer

Fred is a Staff Site Reliability Engineer (SRE) who has worked as a software engineer for over a decade and ended up with a healthy dislike of computers and clumsy automation. He’s a published technical author who loves distributed systems, systems engineering, and has a strong interest in resilience engineering and human factors.

Fred Hebert | Jan 28, 2025

Restructuring How We Think About Alerts

Back in Alerts Are Fundamentally Messy, I made the point that the events we monitor are often fuzzy and uncertain. To make a distinction between what is valid or invalid as an event, context is needed, and since context doesn’t tend to exist within a metric, humans go around and validate alerts to add this context. As such, humans are part of the alerting loop, and alerts can be framed as devices used to redirect our attention.

Incident Response Operations

Fred Hebert | Sep 30, 2024

Syncing PagerDuty Schedules to Slack Groups

We’ve posted before about how engineers on call at Honeycomb aren’t expected to do project work, and that whenever they’re not dealing with interruptions, they’re free to work on whatever will make the on-call experience better.

Incident Response Teams & Collaboration

Fred Hebert | Jul 29, 2024

Making Room for Some Lint

It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up influencing the shape of the corrective items (if any) that get created. I’ll cover these ideas by using our June 3rd incident where a database migration caused a large outage by locking up a shared database and making it run out of connections.

Dogfooding Incident Response Software Engineering

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Against Incident Severities and in Favor of Incident Types

How Honeycomb Manages Incident Response

Problem statement

Why it’s a mess

TL;DR: Response severity is different for every team

What we chose to do

The ongoing outcome

Fred Hebert

Related posts

Restructuring How We Think About Alerts

Syncing PagerDuty Schedules to Slack Groups

Making Room for Some Lint

Ready to get started?