Blog

Posts by Lex Neva

Lex Neva

Staff Site Reliability Engineer

Lex is interested in making sociotechnical systems as reliable as they can possibly be (and no more). From reliable technical designs, to policy and process troubleshooting, to incident response, prevention, and analysis, this can take many forms, and Lex wants to dig into all of them. He is the curator of SRE Weekly, a newsletter about all of the above and more.

Software Engineering   Dogfooding   Debugging  

Always. Enable. Keepalives.

As part of our recent failure testing project, we ran into an interesting failure mode involving the OpenTelemetry SDK for Go. In this post, we’ll...

Software Engineering   Dogfooding  

Destroy on Friday: The Big Day 🧨 A Chaos Engineering Experiment - Part 2 

In my last blog post, I explained why we decided to destroy one third of our infrastructure in production just to see what would happen....

Software Engineering   Dogfooding  

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment - Part 1

We recently took a daring step to test and improve the reliability of the Honeycomb service: we abruptly destroyed one third of the infrastructure in...

Incident Response  

Should Every Incident Get a Retro?

At a recent training session, Jeli spent a great deal of time covering incident retrospectives and what makes an incident worthy of studying. My colleague...

Software Engineering   Culture  

The Incident Retrospective Ground Rules

I joined Honeycomb as a Staff Site Reliability Engineer (SRE) midway through September, and it’s been a wild ride so far. One thing I was...