It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up influencing the shape of the corrective items (if any) that get created. I’ll cover these ideas by using our June 3rd incident where a database migration caused a large outage by locking up a shared database and making it run out of connections.

The incident—and the potential for blame

From our short public review, this element came out as most significant:

The migration involved was related to modifying an ENUM set on a database table, which unexpectedly caused a full table rewrite. It had previously run without issue on smaller databases, leading to a false sense of security. Additionally, two prior changes to the same ENUM field had not caused any performance issues. After the restart we made sure that data integrity was properly maintained, that all caches were properly aligned, and that the overall migration could safely complete. We are currently looking at strengthening our ability to spot risky migrations ahead of time (regardless of how well they worked on other databases in other environments).

Each Honeycomb engineer who writes a database migration is responsible for shepherding it through the stages. As they write it, it gets applied in their own dev environment, and on the CI suites to run all the usual tests. They get a code review, and once the code is merged, it gets applied to at least four environments automatically: Dogfood (which monitors production) and Kibble (which monitors Dogfood) in both our US and EU regions. Once they have succeeded, the engineer can then run a command to apply the migration in the production environments.

Basically, there is a process, there are validation and verification checks, and there’s someone looking at what happens. Yet, things still break from time to time. If we were to look for a quick solution, it would be easy to think that some things were lacking, that someone missed something. To name a few potential ones:

  • The engineer who wrote the migration possibly wasn’t thorough enough or knowledgeable enough
  • The engineers who reviewed the pull request possibly weren’t paying enough attention or knowledgeable enough
  • The engineers who wrote or reviewed the pull request possibly weren’t the right people to do the job
  • The organization as a whole did not provide enough training or guidance

These are classic diagnoses, and some were brought up in the midst of the incident: teams with more SQL expertise weren’t on the reviewer list due to a code owner’s misconfiguration. The risky pattern itself (editing an ENUM in a non-additive manner) was sort of a surprise to everyone involved, but we could have been satisfied with our review simply by spreading the knowledge about it and enforcing reviews by experts.

However, database migrations have been a risk for a long while at Honeycomb, as they have in multiple other organizations. Some database migrations we ran in the past straight up failed, hung, caused outages, or in some cases triggered rare bugs in whatever version of MySQL we were running that created extended performance degradation.

While there are approaches that structurally reduce the likelihood of such issues (“store data differently”) or others that can provide better safety nets (“have a better load simulation environment”), they tend to be rather costly for uncertain benefits—would a new data store have no trade-offs? Would a better test environment actually catch these issues with low maintenance overheads?


New to Honeycomb? Get your free account today.


On-call week is for ops work

A suggestion made by someone years ago was that we should lint MySQL migrations. Some patterns are bound to be risky, and these could be detected as part of continuous integration to warn people, particularly since for the entire history of Honeycomb, all database schema changes have been numbered and versioned. Linting provides a cheap feedback loop, requires little setup, and can capture risky patterns even if they aren’t all covered.

Maybe the increased confidence of a linter saying “this is okay” could be good enough. There’s always a risk that safety nets provide a false sense of trust for, but that false sense of trust is often there: we don’t know what we don’t know, and because not everyone knows the same stuff, we often get surprised either way.

There was an easy way to find out: I was on call the week after the incident and I had this in mind. One of the policies Honeycomb has is that whenever you’re on call, you don’t do project work. The on-call engineer’s job, aside from escalations and incidents, is to make on-call better for everyone. The end result of this is that if you’re on call, and have a bit of downtime, you’re free to explore solutions and improvements, paying down forms of tech debt you think impact reliability, write documentation, or improve instrumentation. Anything you can point at and say “this should make on-call easier” may fit in that week.

That let me give myself the mandate of just seeing what could be done and experimenting, regardless of the action items.

Setting up the linter

After researching and eliminating many candidates, we settled on Atlas, which has a full SaaS offering, but also has offline community editions—which we chose to use. They have a full list of checks, some of which are specifically related to the ENUM issue we saw.

The problem upon trying it was that the linter did not detect the issue. It just took the migrations and said they were all fine. That is, a migration like this

ALTER TABLE test
-- assume the enum is currently specified as ENUM('A', 'B', 'C', 'D')
CHANGE enum_col enum_col ENUM('A', 'B', 'C') NOT NULL DEFAULT 'A';

would succeed every time. What I found out is that as Atlas connected to an empty database to run tests, we could force it to complain by mandating an algorithm:

ALTER TABLE test
 -- assume the enum is currently specified as ENUM('A', 'B', 'C', 'D')
 CHANGE enum_col enum_col ENUM('A', 'B', 'C') NOT NULL DEFAULT 'A',
ALGORITHM=INPLACE;

With this specification, we then get warnings such as:

 Error: executing statement: Error 1846 (0A000): ALGORITHM=INPLACE is
 not supported. Reason: Cannot change column type INPLACE. Try ALGORITHM=COPY.

This is a bit of a cheat that relies on specifying algorithms for DDL operations:

  • INSTANT means the operations only modify metadata in the data dictionary. An exclusive metadata lock on the table may be taken briefly during the execution phase of the operation. Table data is unaffected, making operations instantaneous.
  • INPLACE means the operations avoid copying table data but may rebuild the table in place. An exclusive metadata lock on the table may be taken briefly during preparation and execution phases of the operation. Typically, concurrent operations (INSERT, UPDATE, DELETE) are supported.
  • COPY means operations are performed on a copy of the original table, and table data is copied from the original table to the new table row by row. Concurrent operations (INSERT, UPDATE, DELETE) are not permitted.

Basically, any COPY operation on either a large table or a table used in a lot of queries will likely hang a production database and should be rethought.

So what we did is simply add a little script that checks that any ALTER TABLE statement specifies an algorithm that is not COPY. This makes the linter check effective and ensures we won’t interrupt production transactions with migrations.

Working around it

The thing with automation of this kind is that it can do a great job at spotting common issues, but a terrible one at properly handling subtle elements that can be contextual. 

Most good linters have syntactic rules in place that let you override its values and checks. Atlas has simple ones that disable rules piecemeal by putting them right before a statement, so we added a way to disable the ALGORITHM check mentioned earlier. Statements like the following can be added to any migration file:

-- precheck:nolint algorithm // can be put anywhere before a statement
-- atlas:nolint CHECK_A [CHECK_B ... CHECK_N] // must be right before a statement

These are “acknowledgements” of whatever the linter told you before you merge your migration. We trust people to make the right calls, and if it turns out to be surprising, we’ll also be around to help fix whatever happens.

All this information is in a document right next to the linter’s code. Our biggest problem right now is tying the visible linter error to the relevant information in our internal documentation.

Knowing, as an organization

Since we only see a few migrations a year that cause problems, we haven’t yet had the opportunity to see the new linter capture anything, aside from it having asked a few people to explicitly put in their algorithm annotation, and to acknowledge some risks.

However, it seems like an interesting balance between the proactive (but prohibitively costly) approach of training everyone to become SQL experts, and an ever-forgiving approach of letting people be surprised by production changes and then doing the repair work. Linting finds a decent spot in the middle—and for cheap—and as such, it may plug some holes more effectively than other methods.

Overall, I still feel that the best driver for improvements like this is having the latitude to take ownership and implement these experiments.

There’s always a little something wrong with systems. Sometimes your on-call shift is spent working on fixing urgent stuff, but some weeks are eerily calm. What are you going to do during your next, calmer on-call shift? Let us know in Pollinators, our Slack community!

Don’t forget to share!
Fred Hebert

Fred Hebert

Staff Site Reliability Engineer

Fred is a Staff Site Reliability Engineer (SRE) who has worked as a software engineer for over a decade and ended up with a healthy dislike of computers and clumsy automation. He’s a published technical author who loves distributed systems, systems engineering, and has a strong interest in resilience engineering and human factors.

Related posts