“Is observability just monitoring with another name?”
“Observability: we changed the word because developers don’t like monitoring.”
There’s been a lot of hilarious snark about this lately. Which is great, who doesn’t love A+ snark? Figured I’d take the time to answer, at least once.
Yes, in practice, the tools and practices for monitoring vs observability will overlap a whole lot … for now. But philosophically there are some subtle distinctions, and these are only going to grow over time.*
“Monitoring”, to anyone who’s been in the game a while, carries certain connotations that observability repudiates. It suggests that you first build a system, then “monitor” it for known problems. You write Nagios checks to verify that a bunch of things are within known good-ish thresholds. You build dashboards with Graphite or Ganglia to group sets of useful graphs. All of these are terrific tools for understanding the known-unknowns about your system.
But what happens when you’re experiencing a serious problem .. but you didn’t know for hours, until it trickled up to you from user reports? What happens when users are complaining, but your dashboards are all green? What happens when something new happens and you don’t know where to start looking? In other words, how do you deal with unknown-unknowns?
Known-unknowns are (relatively) easy (or at least the paths are well-trodden). Unknown-unknowns are hard.
But here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers … the majority of your questions trend towards the unknown-unknown.
Debugging distributed systems looks like a long, skinny tail of almost-impossible things rarely happening. You can’t predict them all; you shouldn’t even try. You should focus your energy on instrumentation, resilience to failure, and making it fast and safe to deploy and roll back (via automated canaries, gradual rollouts, feature flags, etc).
The same goes for large apps that have been in production a while. No good engineering team should be getting a sustained barrage of pages for problems they can immediately identify. If you know how to fix something, you should fix it so it doesn’t page you. Fix the bug, auto-remediate the problem, or hell–just disable paging alerts in off-hours and make the system resilient enough to wait ‘til morning. (Please!)
In the end, the result is the same: engineering teams should mostly get paged only about novel and undiagnosed issues. Which means debugging unknown-unknowns is more and more critical.
You can’t predict what information you’re going to need to know to answer a question you also couldn’t predict. So you should gather absolutely as much context as possible, all the time. Any API request that enters your system can legitimately generate 50-100 events over its lifetime, so you’ll need to sample heavily. (See our sampling docs for more best practices.)
“Observability” is a term that comes from control theory. From Wikipedia:
“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals.”
In ordinary English, what this means is that you have the instrumentation you need to understand what’s happening in your software. Observability focuses on the development of the application, and the rich instrumentation you need, not to poll and monitor it for thresholds or defined health checks, but to ask any arbitrary question about how the software works.
An observable system is one you can fully interrogate. Given a pile of millions of needles, one or two of which have problems, can you slice and dice and sort finely enough to quickly locate literally any given needle?
Monitoring is great. We’re big fans. But it’s not what we’re trying to build here.
(Historical side note: we first adopted the term because companies like Netflix, Twitter, etc tend to use “observability” internally. Lots of our users sign up for Honeycomb because they desperately miss the kind of tooling they used to have at their $bigco job, so the association was useful.)
* Could you say that observability is a subset of monitoring? Sure, you could! But what term would you use for older-style thresholds-and-canned-dashboards? I’m stumped on that point, so I’ve been calling it “monitoring”. If you have a better term, please share!)