AIOps: Prove It!
I’ve read a steadily increasing stream of articles about using AI in SRE, and I have yet to find one that inspires my trust. Each...
Always. Enable. Keepalives.
As part of our recent failure testing project, we ran into an interesting failure mode involving the OpenTelemetry SDK for Go. In this post, we’ll...
Destroy on Friday: The Big Day 🧨 A Chaos Engineering Experiment - Part 2
In my last blog post, I explained why we decided to destroy one third of our infrastructure in production just to see what would happen....
Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment - Part 1
We recently took a daring step to test and improve the reliability of the Honeycomb service: we abruptly destroyed one third of the infrastructure in...
Should Every Incident Get a Retro?
At a recent training session, Jeli spent a great deal of time covering incident retrospectives and what makes an incident worthy of studying. My colleague...
The Incident Retrospective Ground Rules
I joined Honeycomb as a Staff Site Reliability Engineer (SRE) midway through September, and it’s been a wild ride so far. One thing I was...