Against Incident Severities and in Favor of Incident Types
About a year ago, Honeycomb kicked off an internal experiment to structure how we do incident response. We looked at the usual severity-based approach (usually...
Syncing PagerDuty Schedules to Slack Groups
We’ve posted before about how engineers on call at Honeycomb aren’t expected to do project work, and that whenever they’re not dealing with interruptions, they’re...
Making Room for Some Lint
It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up...
Negotiating Priorities Around Incident Investigations
There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions...
Alerts Are Fundamentally Messy
Good alerting hygiene consists of a few components: chasing down alert conditions, reflecting on incidents, and thinking of what makes a signal good or bad....
From Oops to Ops: SLOs Get Budget Rate Alerts
As someone living the Honeycomb ops life for a while, SLOs have been the bread and butter of our most critical and useful alerting. However,...
Incident Review: What Comes Up Must First Go Down
On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC, during which...
There Are No Repeat Incidents
People seem to struggle with the idea that there are no repeat incidents. It is very easy and natural to see two distinct outages, with...
How We Define SRE Work, as a Team
The SRE team is now four engineers and a manager, and we are involved in all sorts of things across the organization, across all sorts...
How We Manage Incident Response at Honeycomb
When I joined Honeycomb two years ago, we were entering a phase of growth where we could no longer expect to have the time to...
Counting Forest Fires: Incident Response Metrics
There are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know...
Incident Review: Shepherd Cache Delays
In this incident review, we’ll cover the outage from September 8th, 2022, where our ingest system went down repeatedly and caused interruptions for over eight...
Incident Review: Working as Designed, But Still Failing
A few weeks ago, we had a couple of incidents that ended up impacting query performance and alerting via triggers and SLOs. These incidents were...
On Counting Alerts
A while ago, I wrote about how we track on-call health, and I heard from various people about how “expecting to be woken up” can...
Tracking On-Call Health
If you have an on-call rotation, you want it to be a healthy one. But this is sort of hard to measure because it has...