The most successful software development movement of my lifetime is probably test-driven development or TDD. With TDD, requirements are turned into very specific test cases, then the code is improved so the tests pass. You know it, you probably use it; and this practice has helped our entire industry level up at code quality.
But it’s time to take a step beyond TDD in order to write better software that actually runs well in production. That step is observability driven development.
Using TDD to Drive Better Code
TDD has some powerful things going for it. It’s a very pure way of thinking about your software and the problems it’s trying to solve. TDD abstracts away the grimy circus of production and leaves you with deterministic, repeatable bits of code that you can run hundreds of times a day, giving you the warm, fuzzy assurance that your software would continue to work today the same as it worked yesterday and the day before that. But that assurance quickly fades when you start considering whether having passing tests means that your users are actually having a good product experience. Do those passing tests mean that any errors and regressions can be crispily isolated and fixed before your code is released back into the wild?
TDD helps produce better code, but a fundamental limitation of TDD is exactly the thing that makes it most appealing. With TDD, your tests run in a hermetically sealed environment. Everything in that environment is erased and recreated from zero on each run: your data is dropped and seeded afresh, your storage and remote systems are empty mocks. There is no chaotic human element, only wave upon wave of precisely specified bots and mocks performing bounds checks, serializing and deserializing, and checking for expected results, again and again.
All of the interesting deviations that your code might encounter out in the wild have been excised from that environment. We remove those deviations in the interest of making your code testable and tractable. There are no charming surprises: the unexpected is violently unwelcome. Any deviation from the spec must be dealt with — immediately.
But just because something about the environment doesn’t go according to plan and gets excluded from TDD, that doesn’t mean it isn’t valuable. In fact, one might reasonably argue that those deviations are the most valuable parts of your system; the most interesting, valuable, and worthwhile things to surface, watch, stress, and test. Because it’s all of those things that are really going to shape how your software actually behaves when real people start interacting with it.
If this rings true to you, then you may be interested in another method of validating and gaining confidence in your code. I have been referring to that approach as “observability-driven development”, or ODD. That’s oh-dee-dee, because using real data obtained from operating your software in production to drive better code is an approach that no engineer should find odd.
Using Production to Drive Better Code
“But that’s not how it’s done! We have confidence in our tests!!!”
The tests in your code are still valuable. But there’s an additional step we need to take in order to extend our validation to encompass the reality of production. It requires shifting your mindset, developing a practice, and forming a habit.
Embrace failures. Instead of being afraid of failure and trying desperately to avoid it, try adopting a mindset of cheery fatalism. Everything will fail eventually, usually at the worst possible time, and in a way you failed to predict. The first step is admitting that you cannot possibly predict all the entertainingly disastrous ways that your precious code is going to fail in the real world. All the different scenarios you so painstakingly enumerated and wrote tests for are but grains of sand on a beach. Accepting this might take some time. Go on. I’ll wait.
Just as you wouldn’t accept a pull-request without tests, you should never accept a pull-request unless you can answer the question, “how will I know when this isn’t working?”
Close the loop. The habit you then form is one of relentlessly circling back to check on your code once it has been released into the wild. It’s a habit of checking up on any code that has just been deployed through the lens of the instrumentation you just wrote. Is it working as intended? Are you sure? Does anything else look… weird? This should be as automatic as muscle memory. Your job is not done when you have merged to master. It is not done until you have watched it run in the wild, kicked the tires, and made sure it is working as intended.
This step, when followed regularly, will catch the overwhelming majority of problems in production before users can notice and before they’re big enough to trigger an alert. It also helps you catch those transient hard-to-find problems that will never cause big enough errors to trigger a monitoring alert. Plus it catches them at the optimum time: right after you’ve built it and shipped it, while your original intent is still warm and fresh in your mind, before it’s had the chance to decay or page out for all the other things competing for your attention throughout the day.
You need to follow that step so often that checking if your code is working as intended via instrumentation becomes muscle memory: it becomes a natural part of what happens every time you deploy code. It feels weird to not check how it’s running. You should have a nagging itch in the back of your mind that won’t simmer down until you close the loop on that deployment by checking to see how your code is doing in prod.
TDD + Prod = ODD
That’s what I’ve been calling Observability Driven Development. It’s the coding equivalent of wearing a headlamp to go for a hike in the darkness; anywhere you go, it lights up your feet on the path and two steps ahead of you.
With TDD, you rely on automated test suites to raise a hand and object if your code seems to be doing something wrong. All of the tests passed? That’s a green light! Your job is done when the branch is merged and tests have passed; that’s all the confidence you need to move on. Deploying that code is probably someone else’s job. Once it’s in prod, bugs will be surfaced by monitoring software (if you’re lucky) or unhappy users (if you’re not), and eventually make their way back to you or your team in the form of tasks or tickets.
This is a feedback loop that works, more or less, but it is long and slow and leaky. The person peering at your code in prod probably doesn’t know what they’re looking for or looking at, because they don’t have access to your original intent. By the time the bugs wend their way back to you — days, weeks, or months later — you too have probably forgotten a lot of relevant context.
With ODD, you’ve accepted that you can’t enumerate every failure, so you have far less confidence in the ability of any canned tests to surface behavioral anomalies. But you do have the greatest source of chaos and anomalies in the known universe to learn from: live users. Simply running your service with an open port to production invites chaos enough!
Your instrumentation doesn’t exist to serve a set of canned questions, it’s there to unlock your active, curious, novel exploration of the ways users are interacting with your systems: the beating heart of observability. If you make it a daily practice to engage with your code in prod, you will not only better serve your users, you will also hold your systems to a higher standard of cleanliness and understandability. You will develop keen technical instincts, you will write better code. You will be a better engineer.
Start going down the path of Observability Driven Development and follow your curiosity to wherever it leads.
Download our Guide to Achieving Observability and learn more about observability-driven development.
This post was originally featured on TheNewStack on 9 June 2020.