Why Observability 2.0 Is Such a Gamechanger

Why Observability 2.0 Is Such a Gamechanger

10 Min. Read

One of the hardest parts of my job is to get people to appreciate just how much of a difference Honeycomb/observability 2.0 is compared to their current way of working. It’s not just a small step up or a linear improvement. Rather, it’s an entire step change in the way that you write, deploy, and operate software for your customers.

What is observability 2.0, and how is it different?

Charity wrote an amazing blog post recently about the key difference between observability 1.0 and 2.0. The TL;DR is this nugget:

  1. Observability 1.0 has three pillars and many sources of truth, scattered across disparate tools and formats. 
  2. Observability 2.0 has one source of truth, wide structured log events, from which you can derive all the other data types.

Wide structured events allow you to include all of the context in an event—like userId, enabled feature flags, or cart contents.

With observability 2.0, you can derive any metric across any (number of) dimensions. You can calculate the P90 latency of a particular HTTP endpoint, for a particular mobile client version in a particular country. Or the P95, or a heatmap—whatever is more useful to you.

That, by itself, is profound—but it’s what that enables that makes such a difference.


Read Charity’s whitepaper: The Bridge From Observability 1.0 to Observability 2.0.


Debug in minutes instead of hours or days

One of my favorite quotes about Honeycomb came from an engineer at a large bank: “In our first week of using Honeycomb, we found the cause of four bugs that had been annoying us for months.”

This is something we hear over and over again.

One of the reasons why debugging is such a different experience is because in a 1.0 world, where you have multiple sources of truth, finding the cause of an issue is very much like playing Clue, the murder mystery game. You get a bunch of clues, but your job as a detective is to figure out which clues are relevant and carefully stitch together the timeline of the murder to figure out who killed our victim, where, and with what weapon.

In this murder mystery analogy, the observability 2.0 equivalent is trying to stitch together the timeline of the event, but this time, you have access to all the relevant security camera footage.

The act of debugging is to find out how and why the application got to an undesirable state. And with metrics and logs, it is up to you, the detective engineer, to take those clues and try to replicate that state.

This is why only your most senior and tenured engineers can effectively debug non-trivial problems: Along with debugging skills, you also need an (accurate) mental model of the systems you are debugging to put everything into context.

Don’t believe me? When was the last time you found the cause of a problem by realizing something was not logged?

When traces are your source of truth, things become trivial very quickly. All of the relevant context is right there and the steps are included.

But wait, there’s more! In an observability 1.0 world, you can only generate hypotheses. To validate, you either “fix” what you thought was the problem and hope that it actually fixes it, or you add more logging, deploy to production, and wait until the problem happens again.

With observability 2.0 and the ability to query your data along any dimension, you can often validate a hypothesis with another query. If my hypothesis is that a particular condition leads to an error, do all instances of that particular condition lead to the error? And is that condition the only one that leads to the error?

Debugging with wide structured events often feels like cheating.

Deploy and run software with confidence

Confidence is not a word often associated with deploying and running software, because there’s always a decent amount of risk involved when making changes.

A staging environment is fine, but it’s not the same as production and doesn’t have the same data as production. To top that off, your users are going to hit many more edge cases than you can think of.

Most engineers, after deploying a change to production, will either stare at a dashboard for a while to see if anything looks out of the ordinary, or glance at their phone every 30 seconds to check they didn’t miss an alert. In contrast, in an observability 2.0 world where you can validate your hypothesis by querying, you can also validate that your change was correct by querying.

If you tried to improve the performance of a particular operation with a lot of subtasks, you don’t need to try to tease out the effectiveness of your change based on the latency of all tasks. You can graph latency for operations with more than n subtasks to see if you decreased latency, and also check the operations with fewer than n subtasks to make sure you didn’t increase latency.

The same goes for bugs. Because you have rich context in every event, you can query for the type of events that should now behave differently. When I ran my own SaaS company, most of our PRs included the query we would run after deployment to check if the change was successful.

If a change is too risky to roll out to production straight away, you can release that change with a feature flag (or in a canary!) and compare flag/canary behavior vs current behavior to double check your change had the intended effect.

Alerts have entered the chat

But what if something goes wrong at any other time? This is where alerting comes in. How do we get notified when something goes wrong?

With traditional alerting, you often have what is called “Good minute/bad minute” alerts. Based on metrics, your tool of choice will figure out if that minute was good or bad. If you have enough bad minutes in a row, you will be alerted. Sounds great, until you get paged at 3 a.m. because two out of four requests failed and a 50% failure rate was something you thought you wanted to be alerted on. So now you have to combine multiple metrics in your alerts to try and filter out some of the obvious false positives. And because all of those metrics are independent of each other, you probably still can’t answer a simple question, like “What percentage of requests to this endpoint were either an error or too slow?” Let alone something more complex, like “What was the P99 latency to this endpoint for our enterprise customers?”

This is the crux of observability 2.0 and Service Level Objectives (SLOs).

Having access to those wide events means that we can answer those questions, and we can alert based on those over a much longer period of days or even weeks.

It’s impossible to design a system that never fails—and even if you could, it would be astronomically expensive to do so. You always have to trade off reliability for features. Thus, you have to be open to some percentage of your workload to go wrong—the only question is how much.

SLOs can alert you based on user-centric events over longer periods of time to make sure you keep within those error budgets. 

Imagine being confident everything is going to be ok when you deploy your change. Or being confident that everything is ok with your service without having to scan through multiple dashboards.

Control costs

Our CTO and co-founder, Charity Majors, wrote a lengthy whitepaper on the cost crisis in metrics tooling, but the TL;DR is that for metrics, every unique combination of tags, or context, is a separate “metric.” So if you want to record the latency of your public API and you want to record max, average, P90, P95, and P99 for every host, endpoint, HTTP status code, etc., you multiply your cost for every unique value in a tag. The only way for you to manage costs is by reducing what you measure, or reducing the context that is provided with it. But when your source of truth is individual events, the way to manage your cost is by sampling those events. Instead of sending all events, you only send in a subset of the events and have the backend adjust for it when creating graphs.

Obviously this discards some data, but this is in statistical precision. You can measure everything you like and include all the context you need, and leverage different sampling strategies to help mitigate that as well.

Heads or tails?

There are two ways to sample events: 

  • Head sampling, which decides at the start of a trace if that trace is going to be sampled.
  • Tail sampling, where the entire trace is analyzed and then the sampler makes a decision on whether or not to keep the trace.

You can define rules to have different sampling strategies for different traces.

Head sampling is very easy to set up and run, but you need to make a decision when you have little information, whereas tail sampling is harder to set up and more expensive to run, but you have all of the information available when you need to make a sampling decision. Because fast, successful events are (hopefully) the vast majority of your total event data, tail sampling in particular allows you to radically reduce the amount of events you collect without meaningfully reducing your statistical precision.

As an analogy, let’s take security cameras in an office building. It’s expensive to upload recorded video. In the metrics scenario, you pay a flat fee for every camera that you have, based on the resolution and frame rate of the camera that you put in.

In contrast, the observability 2.0 security cameras have sensors that detect people in them and only upload video when they have detected a human in frame—you don’t really care about things like resolution or frame rate, or even how many cameras you have.

And if the bill is ever too much, the equivalent of head sampling is that you can decide not to upload certain videos if the person in frame is, for example, a staff member. The tail sampling scenario is that you can upload videos based on if anything suspicious happened in that clip.

In the first scenario, you have to be very careful where you put your cameras and try to tweak the resolution and frame rate so that you balance cost and usefulness. That’s a stark contrast to the second scenario, where you can put high-resolution cameras everywhere and focus on filtering out low information clips.

This means that you will never be penalized for adding more context.

Conclusion

Observability 2.0, where the source of truth for your observability data is in wide structured events and you have the ability to query and graph along any number of dimensions, is not just a small improvement. It is an absolute gamechanger when it comes to developing and running your applications.

It allows everyone in your engineering organization the power to quickly debug problems, not just your most senior and tenured engineers. Additionally, it can give you real confidence that your systems are working as intended, and to alert you only when user-impacting events are happening.

Most importantly, it can do that while giving you much more control over your costs, without forcing you to decide ahead of time what you need to measure. This is crucial in a world where complex, distributed systems make issues an unknown-unknown. 

Don’t forget to share!
Erwin van der Koogh

Erwin van der Koogh

Customer Architect

Erwin, a long-time Honeycomb customer, recently joined the company as our first Customer Architect in the APAC region. He resides in Australia with his family, dog, and two cats (one of which is convinced she is a dog), and can often be found speaking at conferences and meetups around the country.

Related posts