Against Incident Severities and in Favor of Incident Types
About a year ago, Honeycomb kicked off an internal experiment to structure how we do incident response. We looked at the usual severity-based approach (usually...
Syncing PagerDuty Schedules to Slack Groups
We’ve posted before about how engineers on call at Honeycomb aren’t expected to do project work, and that whenever they’re not dealing with interruptions, they’re...
Making Room for Some Lint
It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up...
Negotiating Priorities Around Incident Investigations
There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions...
Alerts Are Fundamentally Messy
Good alerting hygiene consists of a few components: chasing down alert conditions, reflecting on incidents, and thinking of what makes a signal good or bad....
Incident Review: What Comes Up Must First Go Down
On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC, during which...
Incident Management Steps and Best Practices
Incident management is the way an organization reacts to any kind of outage (security, broken code, severe weather, or anything that’s disruptive to customer service)....
There Are No Repeat Incidents
People seem to struggle with the idea that there are no repeat incidents. It is very easy and natural to see two distinct outages, with...
Should Every Incident Get a Retro?
At a recent training session, Jeli spent a great deal of time covering incident retrospectives and what makes an incident worthy of studying. My colleague...
How We Manage Incident Response at Honeycomb
When I joined Honeycomb two years ago, we were entering a phase of growth where we could no longer expect to have the time to...
Counting Forest Fires: Incident Response Metrics
There are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know...
Solving a Murder Mystery
Bugs can remain dormant in a system for a long time, until they suddenly manifest themselves in weird and unexpected ways. The deeper in the...
Incident Report: The Missing Trigger Notification Emails
On November 18, between 00:50 and 00:56 UTC, an update was deployed that improved Honeycomb’s business intelligence (BI) telemetry available from our production operations environment....
Incident Report: Investigating an Incident That's Already Resolved
Summary On the 23rd of April, we discovered that an incident had occurred approximately one week earlier. On April 16, for approximately 1.5 hours we...
Incident Report: Running Dry on Memory Without Noticing
On November 6, 2019, we intermittently rejected 1-3% of customer telemetry data at ingest for four periods of 20 minutes each. The trigger of the...