“Observability” is a term that comes from control theory. From Wikipedia:
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals.
Observability is what we need in a world where most problems are the convergence of three, five, 10+ different things failing at once. Platforms that incorporate multiple components always produce a long tail of new questions to ask.
We built Honeycomb to answer those questions–to deal with microservices, serverless, distributed systems, polyglot persistence, containers, CI/CD–and build an understanding of how your systems and software actually work. You can’t debug what you can’t know.
Comparing observability to some classic options–monitoring, metrics, and log aggregation:
“Monitoring” is used as an umbrella term for operational visibility. It generally means you have a set of automated checks that run against systems to ensure none of those things that signify trouble are happening (in any of the ways you predicted).
“Metrics” are streams of datapoints (usually counters and gauges) with optional metadata. These usually rolled ups over intervals, sacrificing precious detail about individual events in exchange compactness. They are requently used for monitoring and powering dashboards. They are severely limited in the amount of context you can append due to the explosive nature of timeseries (try making a dimension or tag based on unique IP).
“Log aggregation” is the most like Honeycomb, because “logs” tell linear stories about events. But log based systems typically revolve around string processing (not getting any faster), regexps (not getting more maintainable), and the need to predictively index anything you might want to search on (or you’re straight back to distributed grep).
Observability requires rich instrumentation, not to poll and monitor for thresholds or defined health checks, but to ask any arbitrary question about how your software works. An observable system is one you can fully interrogate.
Debugging distributed systems looks like a long, skinny tail of almost-impossible things rarely happening. We can’t predict them all; We shouldn’t even try. Focus your energy on instrumentation, resilience to failure, and making it fast and safe to deploy and roll back (via automated canaries, gradual rollouts, feature flags, etc).
We can’t predict what information we’re going to need to know to answer a question which also couldn’t be predicted. So gather absolutely as much context as possible, including high-cardinality information like
build_id. Any API request can legitimately generate 50-100 events over its lifetime, so sample heavily.
Instrumentation is just as important as unit tests. Running complex systems means we can’t model the whole thing in our heads, and we shouldn’t even try because it’s a crutch that’s impossible anyway. Instead, focus on making every component consistently understandable.