Blog

Posts by Fred Hebert

Fred Hebert

Staff Site Reliability Engineer

Fred is a Staff Site Reliability Engineer (SRE) who has worked as a software engineer for over a decade and ended up with a healthy dislike of computers and clumsy automation. He’s a published technical author who loves distributed systems, systems engineering, and has a strong interest in resilience engineering and human factors.

LLMs Best Practices

AI: Where in the Loop Should Humans Go?

AI is everywhere, and its impressive claims are leading to rapid adoption. At this stage, I’d qualify it as charismatic technology—something that under-delivers on what...

Teams & Collaboration Service Level Objectives

Slicing Up—and Iterating on—SLOs

One of the main pieces of advice about Service Level Objectives (SLOs) is that they should focus on the user experience. Invariably, this leads to...

Operations Incident Response

Restructuring How We Think About Alerts

Back in Alerts Are Fundamentally Messy, I made the point that the events we monitor are often fuzzy and uncertain. To make a distinction between...

Teams & Collaboration Incident Response

Against Incident Severities and in Favor of Incident Types

About a year ago, Honeycomb kicked off an internal experiment to structure how we do incident response. We looked at the usual severity-based approach (usually...

Teams & Collaboration Incident Response

Syncing PagerDuty Schedules to Slack Groups

We’ve posted before about how engineers on call at Honeycomb aren’t expected to do project work, and that whenever they’re not dealing with interruptions, they’re...

Software Engineering Incident Response Dogfooding

Making Room for Some Lint

It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up...

Incident Response

Negotiating Priorities Around Incident Investigations

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions...

Service Level Objectives Incident Response

Alerts Are Fundamentally Messy

Good alerting hygiene consists of a few components: chasing down alert conditions, reflecting on incidents, and thinking of what makes a signal good or bad....

Service Level Objectives Product Updates

From Oops to Ops: SLOs Get Budget Rate Alerts

As someone living the Honeycomb ops life for a while, SLOs have been the bread and butter of our most critical and useful alerting. However,...

Incident Response

Incident Review: What Comes Up Must First Go Down

On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC, during which...

Incident Response

There Are No Repeat Incidents

People seem to struggle with the idea that there are no repeat incidents. It is very easy and natural to see two distinct outages, with...

Software Engineering

How We Define SRE Work, as a Team

The SRE team is now four engineers and a manager, and we are involved in all sorts of things across the organization, across all sorts...

Incident Response

How We Manage Incident Response at Honeycomb

When I joined Honeycomb two years ago, we were entering a phase of growth where we could no longer expect to have the time to...

Incident Response

Counting Forest Fires: Incident Response Metrics

There are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know...

Debugging

Incident Review: Shepherd Cache Delays

In this incident review, we’ll cover the outage from September 8th, 2022, where our ingest system went down repeatedly and caused interruptions for over eight...

1 2 »

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Blog

Posts by Fred Hebert

Fred Hebert

Staff Site Reliability Engineer

AI: Where in the Loop Should Humans Go?

Slicing Up—and Iterating on—SLOs

Restructuring How We Think About Alerts

Against Incident Severities and in Favor of Incident Types

Syncing PagerDuty Schedules to Slack Groups

Making Room for Some Lint

Negotiating Priorities Around Incident Investigations

Alerts Are Fundamentally Messy

From Oops to Ops: SLOs Get Budget Rate Alerts

Incident Review: What Comes Up Must First Go Down

There Are No Repeat Incidents

How We Define SRE Work, as a Team

How We Manage Incident Response at Honeycomb

Counting Forest Fires: Incident Response Metrics

Incident Review: Shepherd Cache Delays

Ready to get started?