OnCallogy Sessions
Being on call is challenging. It’s signing up to be operating complex services in a totally interruptible manner, at all hours of the day or...
On the Brittleness of Dashboards
Dashboards are one of the most basic and popular tools software engineers use to operate their systems. In this post, I'll make the argument that...
How We Define SRE Work
At the time of writing this post, I have officially been at Honeycomb for one year as a site reliability engineer (SRE). I had shared...
Incident Resolution: Do You Remember, the Twenty Fires of September?
From September to early October, Honeycomb declared five public incidents. Internally, the whole month was part of a broader operational burden, where over 20 different...
Data Availability Isn’t Observability
But it’s better than nothing... Most of the industry is racing to adopt better observability practices, and they’re discovering lots of power in being able...
Lessons Learned From the Migration to Confluent Kafka
Over the last few months, Honeycomb’s platform team migrated to a new iteration of our ingest pipeline for customer events. Our migration to this newer...
On Not Being a Cog in the Machine
This is my first week here as the first dedicated SRE for Honeycomb, and in a welcoming gesture, I was asked if I wanted to...