Incident Management Steps and Best Practices

By: Valerie Silverthorne | July 20th, 2023

7 Min. Read

According to the Uptime Institute’s 2022 Outage Analysis report, one out of every five companies has experienced a “serious” or “severe” incident over the past three years—a percentage that’s increasing. Those incidents are expensive: over 60% cost more than $100,000, while 15% set their companies back close to $1 million. To put this in perspective, in 2019, only 39% of incidents cost more than $100,000 so the trend lines aren’t moving in the right direction.

A well-thought-out incident management plan that’s created and practiced before it’s needed can lessen these risks. Here’s everything you need to understand to get the most out of incident management.

What is incident management?

Put simply, incident management is the way an organization reacts to any kind of outage (security, broken code, severe weather, or anything that’s disruptive to customer service). Incidents are inherently fraught, not just because they’re time consuming and costly, but because they can potentially poison the well with customers, investors, and even partners.

Incident management requires companies to think through even unlikely scenarios and create plans for rapid discovery and resolution, as well as a robust (but nuanced) communications plan.

Solid incident management response should keep the following factors in mind:

Engineering will usually be key to finding and fixing an incident, but they’re not the only group who will be involved. Plan to include many stakeholders, from the C-suite to lawyers, public relations, partner marketing, and more.
Write the plans down and actively practice them.
Don’t forget about compliance requirements and state and federal laws.
Metrics can be invaluable in detecting incidents, so plan to incorporate service level agreements and service level objectives at a minimum.
Time is of the essence when it comes to incident management, so the more a company can build observability into the development process, the faster incidents can be found and resolved.

It’s important to stress that incident management is not a “nice to have” but a total “must have” for organizations of any size. An upfront investment in a comprehensive incident management plan will have a number of concrete benefits.

For starters, an incident management plan will make it easier to handle a small problem before it spirals out of control.
At the heart of any incident management effort is communication, and that can make the difference between keeping customers and losing them, not to mention keeping other stakeholders up to date as well.

Incident Management Lifecycle

A thoughtful incident management response plan doesn’t require organizations to reinvent the wheel. The National Institute of Standards and Technology (NIST) has a four-step incident response plan suitable for companies of all sizes. Although this plan was created with cybersecurity in mind, the basic steps are a perfect starting point for incidents of any type.

Incident Management Process & Best Practices

Incident prevention

Even with all the prevention in the world, incidents will happen. But, the more preparation teams undertake in advance, the better the outcome.

Start by building a culture of observability and establishing observability-driven development principles. Observability practices, from distributed tracing to establishing service level objectives and service level indicators, can actually allow teams to find problems before customers do, which is close enough to “incident prevention” to count.

Incident identification

The other observability superpower is incident identification. Code that’s been optimized for observability with distributed tracing means it can be sent to an observability platform like Honeycomb for near-instant data analysis and anomaly detection. Speed is everything when an incident is happening, so the more quickly a team can pinpoint the exact cause of a problem, the more quickly it can be fixed. Also, truly observable code provides context around the data, which means anyone on a team can step into the role of troubleshooter.

Incident communication

Having clear, dedicated communication channels—not to mention an up-to-date list of people/roles necessary to include—is perhaps the best antidote to incident management chaos and confusion. No one should be surprised by an ongoing incident, but no one should experience pager fatigue either. Organizations need to find the right balance to create the most effective communication possible.

Incident reporting

Incident reporting and communication are closely related, but there can be significant differences in the “need to know” timing. Those involved with finding and resolving should be immediately looped in, while those who have to deal with potential fallout (customer success, legal, public relations, etc.) are the second tier when it comes to incident reporting. This is another concrete example of why it’s so critical to have an incident management plan.

Incident retrospective

The best way to know if an incident management plan is working is through an incident retrospective. The entire team needs to have a detailed discussion of the successes, failures, and what might be done differently next time. It’s important to be sure to take those findings and bake them into the incident management plan. But it’s equally important to be realistic. For many organizations, lack of time or other resources may make it impossible to “retro” everything. If that’s the case, be sure to establish guidelines around what incidents should take priority.

Practice

Even with the best incident management plan in place, incidents can be stressful, and that stress can make remembering the details of the plan difficult. You want your incident responders to execute it automatically, and a great way to make that more likely is by practicing it in advance. You can schedule mock incidents, called Game Days, in which a team responds to a fictional incident using your plan. Not only will this familiarize them with the plan, but it will also help you find rough spots and sharpen them to a fine point before they’re needed in a real incident.

Incident management: can you depend on tools?

Let’s be clear: there is no silver bullet tool for incident management. In fact, it’s actually the opposite: incident management is a tricky mix of observability and communication tools, best practices, and a thoughtful plan that’s rehearsed regularly.

Teams hoping to take incident management to the next level must be sure they can find and fix incidents quickly (that’s where an observability platform like Honeycomb comes in) and have a way to communicate the outage, its resolution, and any possible fallout.

Make sure tools are regularly reevaluated as part of the incident management plan, but don’t rely just on them. We recommend looking into statuspage.io, ServiceNow, other ticketing systems like Jira, and choosing tools that will work for you in the long term.

Tools we use at Honeycomb are Pagerduty to alert on-call engineers, and Jeli to streamline the incident process. We recently did a webinar with Jeli—it’s worth a watch if you’re learning about the incident process.

Conclusion

Incident management is a fact of modern software development life. Even still, we like to see the benefit of incidents: they are learning opportunities, and as you employ your incident response plan, they can become less stressful. Refine your plan, embrace a culture of observability, and put it all into practice and under review as needed, and you’ll be able to regain some control in the face of unpredictability. There’s never a downside to being prepared.

Go deeper:

Here’s how we manage incident response at Honeycomb

Get more out of incident retrospectives

Understand what incidents have to teach us

Don’t forget to share!

Valerie Silverthorne

Fred Hebert | Jan 28, 2025

Restructuring How We Think About Alerts

Back in Alerts Are Fundamentally Messy, I made the point that the events we monitor are often fuzzy and uncertain. To make a distinction between what is valid or invalid as an event, context is needed, and since context doesn’t tend to exist within a metric, humans go around and validate alerts to add this context. As such, humans are part of the alerting loop, and alerts can be framed as devices used to redirect our attention.

Incident Response Operations

Fred Hebert | Nov 04, 2024

Against Incident Severities and in Favor of Incident Types

About a year ago, Honeycomb kicked off an internal experiment to structure how we do incident response. We looked at the usual severity-based approach (usually using a SEV scale), but decided to adopt an approach based on types, aiming to better play the role of quick definitions for multiple departments put together. This post is a short report on our experience doing it.

Incident Response Teams & Collaboration

Fred Hebert | Sep 30, 2024

Syncing PagerDuty Schedules to Slack Groups

We’ve posted before about how engineers on call at Honeycomb aren’t expected to do project work, and that whenever they’re not dealing with interruptions, they’re free to work on whatever will make the on-call experience better.