Is it Time to Version Observability? Signs Point to Yes

Is it Time to Version Observability? Signs Point to Yes

15 Min. Read

In 2016, we at Honeycomb first borrowed the term “observability” from the wikipedia entry for control systems observability, where it is a measure of your ability to understand internal system states just by observing its outputs. We then spent a couple of years trying to work out how that definition might apply to software systems. Many twitter threads, podcasts, blog posts, and lengthy laundry lists of technical criteria emerged from that work, including a whole ass book.

In 2018, Peter Bourgon wrote a blog post proposing that “observability has three pillars: metrics, logs, and traces.” Ben Sigelman did a masterful job of unpacking why metrics, logs, and traces are just telemetry. However, lots of people latched on to the three pillars language: 

  • Vendors, because they (coincidentally!) had metrics products, logging products, and tracing products to sell. 
  • Engineers, because it described their daily reality.

Since then, the industry has been stuck in a weird space, where the language used to describe the problems and solutions has evolved, but the solutions themselves are largely the same ones as five or ten years ago. They’ve improved, of course—massively improved—but structurally, they’re variations on the same old pre-aggregated metrics.

This is what semantic versioning was made for

Look, I’m not here to be the language police. I stopped correcting people on twitter back in 2019. We all do observability! One big happy family. 👍

I am here to help engineers think clearly and crisply about the problems in front of them. So here we go. 

Let’s call the metrics, logs, and traces crowd—the “three pillars” generation of tooling—observability 1.0

Tools like Honeycomb, which are built based on arbitrarily-wide structured log events, a single source of truth—that’s observability 2.0.

This is literally the problem that semantic versioning was designed to solve, by the way. Major version bumps are reserved for backwards-incompatible breaking changes, and that’s what this is. You cannot simultaneously store your data across both multiple pillars and a single source of truth.

Small technical changes can unlock waves of powerful sociotechnical transformation

There are a lot of ramifications and consequences that flow from this one small change in how your data gets stored. I don’t have the time or space to go into all of them here, but I will do a quick overview of the most important ones.

The historical analogue that keeps coming to mind for me is virtualization. VMs are old technology—they’ve been around since the 70s. But it wasn’t until the late 90s that VMware productized it, unlocking wave after wave of change, from cloud computing and SaaS to the very DevOps movement itself.

I believe the shift to observability 2.0 holds a similarly massive potential for change, based on what I see happening today, with teams who have already made the leap. Why? In a word, precision. O11y 1.0 can only ever give you aggregates and random exemplars. O11y 2.0, on the other hand, can tell you precisely what happened when you flipped a flag, deployed to a canary, or made any other change in production.

Will these waves of sociotechnical transformation ever be realized? Who knows. The changes that get unlocked will depend to some extent on us (Honeycomb), and to an even greater extent, on engineers like you. 

1.0 vs 2.0: How does the data get stored?

1.0 

O11y 1.0 has many sources of truth, in many different formats. Typically, you end up storing your data across metrics, logs, traces, APM, RUM, profiling, and possibly other tools as well. Some folks even find themselves falling back to BI (business intelligence) tools like Tableau in a pinch to understand what’s happening in their systems.

Each of these tools are siloed, with no connective tissue, or only a few, predefined connective links that connect, for example, a specific metric to a specific log line. 

Aggregation is done at write time, so you have to decide upfront which data points to collect and which questions you want to be able to ask. You may find yourself eyeballing graph shapes and assuming they must be the same data, or copy-pasting IDs around from logging to tracing tools and back.

2.0 

Data gets stored in arbitrarily-wide structured log events (often called “canonical logs,” or what AWS internally refers to as “service logs”), often with trace and span IDs appended. You can visualize the events over time as a trace, or slice and dice your data to zoom in to individual events, or zoom out to a birds-eye view. You can interact with your data by group by, break down, etc.

Aggregation is done at read time, and preserves raw events for ad hoc querying. Hopefully, you derive your SLO data from the same data you query! Think of it as BI for systems/app/business data, all in one place. You can derive metrics, or logs, or traces, but it’s all the same data.

1.0 vs 2.0: On metrics vs logs

1.0 

The workhorse of o11y 1.0 is metrics. RUM tools are built on metrics to understand browser user sessions. APM tools are built using metrics to understand application performance. Long ago, the decision was made to use metrics as the source of truth for telemetry because they are cheap and fast, and hardware used to be incredibly expensive.

The more complex our systems get, the worse of a tradeoff this becomes. Metrics are a terrible building block for understanding rich data, because you have to discard all that valuable context at write time, and they don’t support high-cardinality data. All you can do to enrich the data is via tags.

Metrics are a great tool for cheaply summarizing vast quantities of data. They are not equipped to help you introspect or understand complex systems. You will go broke and go mad if you try.

2.0

The building block of o11y 2.0 is wide, structured log events. Logs are infinitely more powerful, useful, and cost-effective than metrics because they preserve context and relationships between data, and data is made valuable by context. Logs also allow you to capture high-cardinality data and data relationships/structures, which give you the ability to compute outliers and identify related events.

1.0 vs 2.0: Who uses it, and how?

1.0  

Observability 1.0 is predominantly about how you operate your code. It centers around errors, incidents, crashes, bugs, user reports, and problems. MTTR, MTTD, and reliability are top concerns.

O11y 1.0 is typically consumed using static dashboards—lots and lots of static dashboards. “Single pane of glass” is often mentioned as a holy grail. It’s easy to find something once you know what you’re looking for, but you need to know to look for it before you can find it.

2.0 

If o11y 1.0 is about how you operate your code, o11y 2.0 is about how you develop your code. O11y 2.0 is what underpins the entire software development lifecycle, enabling engineers to connect feedback loops end to end so they get fast feedback on the changes they make, while it’s still fresh in their heads. This is the foundation of your team’s ability to move swiftly, with confidence. It isn’t just about understanding bugs and outages, it’s about proactively understanding your software and how your users are experiencing it.

Thus, o11y 2.0 has a much more exploratory, open-ended interface. Any dashboards should be dynamic, allowing you to drill down into a question or follow a trail of breadcrumbs as part of the debugging/understanding process. The canonical question of o11y 2.0 is, “Here’s a thing I care about… why do I care about it? What are all of the ways it’s different from all the other things I don’t care for?”

When it comes to understanding your software, it’s often harder to identify the question than the answer. Once you know what the question is, you probably know the answer too. With o11y 1.0, it’s very easy to find something once you know what you’re looking for. With o11y 2.0, that constraint is removed.

1.0 vs 2.0: How do you interact with production?

1.0
You deploy your code and wait to get paged. Your job is done as a developer when you commit your code and tests pass.

2.0

You practice observability-driven development: as you write your code, you instrument it. You deploy to production, then inspect your code through the lens of the instrumentation you just wrote. Is it behaving the way you expected it to? Does anything else look weird?

Your job as a developer isn’t done until you know it’s working in production. Deploying to production is the beginning of gaining confidence in your code, not the denouement.

1.0 vs 2.0: How do you debug?

1.0 

You flip from dashboard to dashboard, pattern-matching and looking for similar shapes with your eyeballs.

You lean heavily on intuition, educated guesses, past experience, and a mental model of the system. This means that the best debuggers are always the engineers who have been there the longest and seen the most.

Your debugging sessions are search-first: you start by searching for something you know should exist.

2.0 

You check your instrumentation, or you watch your SLOs. If something looks off, you see what all the mysterious events have in common, or you start forming hypotheses, asking a question, considering the result, and forming another one based on the answer. You interrogate your systems, following the trail of breadcrumbs to the answer, every time.

You don’t have to guess or rely on elaborate, inevitably out-of-date mental models. The data is right there in front of your eyes. The best debuggers are the people who are the most curious.

Your debugging questions are analysis-first: you start with your user’s experience.

1.0 vs 2.0: The cost model

1.0 

You pay to store your data again and again, multiplied by all the different formats and tool types you’re paying to store it in. Cost goes up at a multiplier of your traffic increase. I wrote a whole piece earlier this year on the cost crisis in observability tooling, so I won’t go into it in depth here.

As your costs increase, the value you get out of your tools actually decreases.

If you use metrics-based products, your costs go up based on cardinality. “Custom metrics” is a euphemism for “cardinality.” “100 free custom metrics” actually means “100 free cardinality,” aka unique values.


Read the whitepaper: The Cost Crisis in Metrics Tooling.


2.0 

You pay to store your data once. As your costs go up, the value you get out goes up too. You have powerful, surgical options for controlling costs via head-based or tail-based dynamic sampling.

You can have infinite cardinality. You are encouraged to pack hundreds or thousands of dimensions in per event, and any or all of those dimensions can be any data type you want. This luxurious approach to cardinality and data is one of the least well understood aspects of the switch from o11y 1.0 to 2.0.

Many observability engineering teams have spent their entire careers massaging cardinality to control costs. What if you just… didn’t have to do that? What would you do with your lives, if you could just store and query on all the crazy strings you want? 

Metrics are a bridge to our past

Why are observability 1.0 tools so unbelievably, eye-bleedingly expensive? As anyone who works with data can tell you, this is always what happens when you use the wrong tool for the job. Once again, metrics are a great tool for summarizing vast quantities of data. When it comes to understanding complex systems, they flail.

I wrote a whole whitepaper earlier this year that did a deep dive into exactly why tools built on top of metrics are so unavoidably costly. If you want the gnarly details, download that.

The TL;DR is this: tools built on metrics—whether RUM, APM, dashboards, etc—are a bridge to our past. If there’s one thing I’m certain of, it’s that tools built on top of wide, structured log events are the bridge to our future.

Wide, structured log events are the bridge to our future

Five years from now, I predict that the center of gravity will have swung dramatically; all modern engineering teams will be powering their telemetry off of tools backed by wide, structured log events, not metrics. 

It’s getting harder and harder and harder to try and wring relevant insights out of metrics-based observability tools. The end of the ZIRP era is bringing unprecedented cost pressure to bear, and it’s simply a matter of time.

The future belongs to tools built on wide, structured log events—a single source of truth that you can trace over time, or zoom in, zoom out, derive SLOs from, etc.

It’s the only way to understand our systems in all their skyrocketing complexity. This constant dance with cost vs cardinality consumes entire teams worth of engineers and adds zero value. It adds negative value.

And here’s the weirdest part. The main thing holding most teams back psychologically from embracing o11y 2.0 seems to be the entrenched difficulties they have grappling with o11y 1.0, and their sense that they can’t adopt 2.0 until they get a handle on 1.0. Which gets things exactly backwards. Because observability 2.0 is so much easier, simpler, and more cost effective than 1.0.

Observability 1.0 is the hard way

We’ve been doing it so long that we are blind to just how hard it is. But trying to teach teams of engineers to wrangle metrics, to squeeze the questions they want to ask into multiple abstract formats scattered across many different tools, with no visibility into what they’re doing until it comes out eventually in the form of a giant bill… It’s hard.

Observability 2.0 is so much simpler: 

  • You want data, you just toss it in. Format? Don’t care. Cardinality? Don’t care. 
  • You want to ask the question, you just ask it. Format? Don’t care.

Teams are beating themselves up trying to master an archaic, unmasterable set of technical tradeoffs based on data types from the 80s. It’s an unwinnable war. We can’t understand today’s complex systems without context-rich, explorable data.

This is happening now—no matter what we call it

Technical language is powerful. It affects how we understand the world, how we think about technical tools, how companies position their offerings vs each other. Engineers tend to be pretty cynical about technical marketing terms, and for good reason. Because technical language is so powerful, lots of companies spend lots of time and money trying to use language to differentiate products and give them an edge in the market.

But there’s a big difference between what I think of as language-driven development vs development-driven language. Language-driven development is when you have marketing and exec teams sitting around trying to come up with branding terms for their secret sauce, or trying to create a new category that their offering can dominate. We’ve all seen companies trying to do this, and it’s always a little bit annoying (“Stop trying to make fetch happen!”).

Development-driven language is when the wave of technical change is already well underway, and we in the field—engineers, marketing teams, and execs—are all grasping for the right evocative words and phrases to describe what’s happening.

Do not dismiss the difficulty of this, nor the importance of doing it well. Naming things is hard. And how we name and describe things has enduring ripple effects in the real world, when it comes to the decisions people make about what to build, what to buy, and where to invest.

The observability 2.0 wave is here, and it needs you

Whether or not the semantic versioning terminology takes off, this shift is happening. It has been underway for years, and has recently begun to pick up serious momentum.

I’ve heard from people who say that even 10 years ago, they were already importing their structured logging data into everything from Tableau to Excel spreadsheets in order to be able to do this kind of analysis and outlier detection. More recently, I hear from people who talk excitedly about reducing and widening their logs and shipping them into Snowflake or Clickhouse, or post-processing their logs into wide log events using something like Cribl. 

My hope is that by carefully unpacking the generational differences between the “three pillars” world, powered predominantly by metrics, and the more recent wave of tooling, with a unified source of truth and rich contextual data, it will help technical practitioners and decision-makers understand where to best invest their scarce time and treasure. I care 0% about whether “observability 2.0” takes off as an industry term. I care 100% about helping people sharpen their thinking and their criteria for tools.

My other hope is that people will stop building new observability startups built on metrics. Y’all, Datadog and Prometheus are the last, best metrics-backed tools that will ever be built. You can’t catch up to them or beat them at that; no one can. 

Do something different. Build for the next generation of software problems, not the last generation.

And if you’re out there building something different, with a unified source of truth, high cardinality and dimensionality… please share your work. Write a blog post, write a twitter thread. Give back to OpenTelemetry. There’s a ton of innovation and interesting work being done in the industry right now, and the real winners have yet to emerge. Your contribution can help.

<3 charity

P.S. Here’s a great piece written by Ivan Burmistrov on his experience using observability 2.0 type tooling at Facebook—namely Scuba, which was the original inspiration for Honeycomb. It’s a terrific piece and you should read it.


Don’t have Honeycomb yet? Get your free account today.


Don’t forget to share!
Charity Majors

Charity Majors

CTO

Charity is an ops engineer and accidental startup founder at honeycomb.io. Before this she worked at Parse, Facebook, and Linden Lab on infrastructure and developer tools, and always seemed to wind up running the databases. She is the co-author of O’Reilly’s Database Reliability Engineering, and loves free speech, free software, and single malt scotch.

Related posts