What Is Observability Engineering?
What Is Observability Engineering?
What Is Observability Engineering?
Table of contents
- What is observability?
- How to determine if a system is observable?
- Observability vs. monitoring
- Why is observability engineering important?
- What does an observability engineer do?
- What is observability-driven development?
- Observability engineering best practices
- Observability engineering at Honeycomb
- Yes, observability enables high-performance engineering
Although it may seem like something out of fantasy, it is possible to find out what you don’t know you don’t know through the magical powers of observability engineering.
Most organizations have monitoring in place, which is fine as far as it goes, but completely useless when it comes to finding out why a random outage has taken place. Observability engineering offers a fast way into the why by letting teams ask questions of the data, visualize anomalies, and pursue possibilities—especially if they’re far-fetched, random, and never-seen-before. In fact, it is exactly that concept of ‘one-off novelty outages’ that observability engineering was created to address.
Here’s what you need to know.
What is observability?
Observability is not a term unique to software development. It was first coined in the 1960s by engineering inventor Rudolf Kalman while he was researching control theory. The idea of observability took hold in the software space during the late 2010s, when complicated cloud-native systems needed a higher level of diagnostics than ever before.
Observability is how modern software development teams discover and understand a problem within a service. It provides teams with a way to ask fact-finding questions of their data, pursue leads, and generally explore anything and everything that’s occurring during an incident. An observable system allows engineers to look past pre-defined monitoring alerts and dive deep into all areas of the system to pursue answers they’d never thought of before. Arbitrarily-wide structured events are at the heart of observability (in our opinion) because they contain up to hundreds of events that can be dissected as needed or put together in order to find anomalous patterns.
What observability isn’t, however, is ‘three pillars.’ If you’ve never heard of the three pillars before, it’s the concept that observability telemetry is divided into three separate buckets: metrics, logs, and traces. The keywords here are divided and separate. We won’t bore you too much on why that’s unequivocally wrong (we’ve written about it enough), but the Cliff’s notes are that observability should provide you with a complete picture of your system—not only parts of it that you must manually stitch together. So why separate them to begin with? Also, the three pillars actually can’t contain all the data required for true observability: business metrics, customer feedback, CI/CD pipeline performance, and many other steps in the SDLC can provide valuable clues and context on the journey.
How to determine if a system is observable?
Ask the following questions to determine if a system is truly observable:
- Is it possible to ask an unlimited number of questions about how your system works without running into roadblocks?
- Can the team get a view into a single user’s experience?
- Is it possible to quickly see a cross-section of system data in any configuration?
- Once a problem is identified, is it possible to find similar experiences across the system?
- Can the team quickly find most load-generating users, hidden timeouts and faults, or a random user complaining about timeouts?
- Can these questions be asked even if they’ve never been imagined before?
- Once these questions have been asked, can they be iterated on, leading the team down new rabbit holes of data and exploration?
- If you have to query your current system, are you able to include and group endless numbers of dimensions, regardless of their importance? And do the query responses come back quickly?
- And finally, do debugging journeys normally end with surprising—or even shocking—results?
An answer of “yes” to all of the above means a system is observable, and also illustrates the observability journey.
Observability engineering requires tools, of course, to allow for data exploration. However, it also requires a curiosity culture where the question “why?” is prevalent. In an ideal world, observability is baked into a software service from the beginning, and organizational enthusiasm for problem-solving is also baked in.
Observability vs. monitoring
Observability and monitoring are often mentioned in the same breath, but they are in fact distinct entities and can be boiled down to “known” vs. “unknown.”
Monitoring was originally built for predictable monolithic architectures, so it’s firmly planted in the “known” realm, where engineers set up alerts based on their system knowledge of what might fail. The alerts, in turn, tell engineers where the problem is, but not why it’s happening. Monitoring’s other serious limitation is that it can’t handle a “never seen that before” situation, simply because monitoring is set up to only alert on known problems that have been predefined as “important” by APM vendors for decades.
Observability, on the other hand, was created for modern distributed systems. It isn’t about alerting once something is already broken and impacting the user experience, but rather, the ability to examine the entire system and user experience in real time to surface anomalies and answer why something is happening before it degrades user experience. Observability engineers approach fact-finding without any preconceived notions and let the data tell them where to look and ask.
Why is observability engineering important?
At a time when consumer tolerance for outages has all but disappeared, the importance of observability engineering can’t be overstated. Teams using an observability strategy can find and fix incidents more quickly than those relying solely on monitoring. Observability engineering is also important because it offers tools and culture changes that support modern software development, including:
- A more robust incident response process.
- Increased transparency.
- A broader understanding of how users interact with the product.
- The opportunity to build observability into software as it’s being created, rather than after the fact.
- Improved workflows and feedback loops.
- True visibility into production environments = the opportunity for tweaks/improvements.
- Better understanding of business needs/requirements.
- The ability to create a culture of problem-solvers.
Benefits of observability engineering
Complex modern software development can appreciate the full benefits of observability engineering, starting with the speed of incident resolution. The faster a problem is found, the faster it can be fixed, saving organizations time, money, and concerns about reputation damage. The money saved is potentially substantial: Software downtime can cost $9,000 per minute, according to research from the Ponemon Institute.
Observability engineering has other benefits as well. Without the need to spend endless hours sorting through logs to resolve issues, teams are able to work on higher-value projects like developing new features or increasing reliability. Many organizations suffer from a “too much information” problem, but observability engineering manages all that data, extracting relevant information that can help resolve an outage. Observability engineering can also help corral data from disparate systems, helping to ease the overwhelming amount of information teams have to process. And there’s a side benefit to surfacing all that data: teams can be more transparent about all aspects of the product, and transparency is key to efficient software development.
And finally, when developers roll out distributed tracing as part of an observability engineering effort, it’s easy for them to visualize the concrete benefits of their coding. They can uniquely leverage rich data such as individual user requests as traces flow through specific infrastructure components. That can lead to better, more efficient application development.
Challenges of observability engineering
Observability is a shift in culture, process, and tools—and that comes with understandable apprehension. When monitoring is all you’ve known for years and it’s worked decently enough, it can be hard to justify a change—not only to business stakeholders, but to engineers that are used to a certain way of doing things. But observability champions that are successful in bringing observability into their organization often rise through the ranks quickly as their teams become more efficient and drive greater impact for their organizations.
Another barrier to observability can be instrumentation. In our Ask Miss O11y series, we’ve received questions from engineers trying to get their manager on board with spending the time to add instrumentation. This can feel daunting, so find a vendor with comprehensive documentation and thought leadership around best practices.
That’s all a long way of saying that the biggest challenges of observability engineering revolve around getting the business side to buy in for a tool purchase (i.e., making the case for improved ROI), as well as nurturing a culture of observability.
The business case should be a straightforward one: observability engineering will save the company from lengthy, costly outages and unhappy users.
What does an observability engineer do?
An observability engineer by any other name could be called an SRE, platform engineer, system architect, any type of DevOps engineer, a tooling admin, or… ?
The term “observability engineer” is a relatively new moniker to describe team members charged with building data pipelines, monitoring, working with time series data, and maybe even distributed tracing and security. While an observability engineer doesn’t necessarily need highly specialized training, the role does require someone who is comfortable with all the data, is curious and likes to problem-solve, while also having strong communication skills.
The ideal observability engineer would be the organization’s observability champion, choosing platforms and tooling, and cross-training key members of the team. They would have a strong grasp on the business needs, customer experience, and product goals. This role would stay up-to-date with the latest trends in observability, and could help create and lead an incident response team through the observability wilderness.
What is observability-driven development?
We’ll take your test-driven development and go one further: observability-driven development (ODD) is a superpower your team can use locally to identify potential issues before they’re actually out in the wild.
We’re not the only ones excited about this: Gartner named observability-driven development as on the rise in its 2022 Gartner Hype Cycle for Emerging Technologies.
Just as test-driven development shifted testing left (and has been a tremendously popular and successful strategy for DevOps teams everywhere), it is possible to shift observability left so that more of it is in the hands of a developer while the code is being written. This is what we like to call “tracing during development” and it has a number of key advantages:
- Developers are the logical folks to tackle this during coding, rather than having to go back later.
- ODD means less context switching, and also eliminates the need to attach debuggers and the tediousness of hitting API calls one by one.
- The process of observability-driven development is going to result in better and cleaner code. Devs can see if the code is “well-behaved” before it gets into production.
That said, we know it might be hard to get developers excited about a big change like ODD. Our best advice: start slowly, make a big deal of the wins, and add instrumentation as incidents happen.
Observability engineering best practices
There are a number of best practices teams can employ to get the most out of observability engineering, including:
- Choose the right tooling: Teams need to be able to see how users experience their code, in real time, even in complicated environments. Tools need to be able to analyze high-cardinality and high-dimensionality data, and do it quickly. Observability platforms must never aggregate or discard telemetry, and should be able to work directly with databases.
- Understand what observability may cost: It should be 20% to 30% of what is invested in infrastructure. Look for tools that have predictable pricing to avoid surprise overage bills.
- But don’t overdo it: Auto-instrument what you can at first to get insights quickly, but take the time to add custom tags through manual instrumentation to truly leverage the power of high-cardinality data.
- Speed up the feedback loop: When trying to resolve an incident, observability means speed—and speed is everything. Ensure the team is structured to get the most out of fast feedback loops.
- Look at end-to-end requests: Remember that context is vital and so is ordering.
- Raw data is king (and you can’t have too much of it!): Don’t settle for less. Any and all types of data are welcome because more data means more context and thus faster problem solving.
- Structured data and wide events are also king: Make sure logs and events are structured so you can maximize the power of your query engine.
- For every hop (or service or query), generate a single event: This is industry best practice.
- Learn to love context: When you don’t even know what you’re looking for, context is what can help guide the process. Everyone on the team should be encouraged to always look for the extra details.
And perhaps most importantly, observability can’t happen without a robust, supportive, and inherently curious culture in place. We know a culture play can be challenging in some organizations, but observability needs to be a team effort in order to get the most out of it. It starts with developers: they need to instrument their code, and they may need to be convinced of the value of that effort. It’s empowering for devs to not only write the code but literally “own it” in production (though we acknowledge this can be a big change in some organizations). But service ownership is truly the most straightforward way to build and sustain a culture of observability.
Also, don’t forget that observability is all about asking questions that haven’t been asked before, so keep reminding teams they’re creating a process for future incidents and their future selves. And this is easier said than done because o11y tools actually hang on to the query history so teams can learn from each other when familiar situations arise.
Observability engineering at Honeycomb
Honeycomb’s approach is fundamentally different from other tools that claim observability, and is built to help teams answer novel questions about their ever-evolving cloud applications.
Other tools silo your data across disjointed pillars (logs, metrics, and traces), are too slow, and constrain teams to only answering predetermined questions. Honeycomb unifies all data sources in a single type, returning queries in seconds—not minutes—and revealing critical issues that logs and metrics alone can’t see. Using the power of distributed tracing and a query engine designed for highly-contextual telemetry data, Honeycomb reveals both why a problem is happening and who specifically is impacted.
Every interface is interactive, enabling any engineer—no matter how tenured—to ask questions on the fly, drill down by any dimension and solve issues before customers notice. Here’s a more in-depth look at what makes Honeycomb different, and why it’s such a profound change from traditional monitoring tools:
- See what’s happening and who’s impacted: Alert investigations in other tools generally start with an engineer viewing an impenetrable chart, followed by hopping between disjointed trace views and log analysis tools, leaving them guessing at the correlations between all three. Instead of this fragmented ‘three pillar’ approach to observability, Honeycomb unifies all data sources (logs, metrics and traces) in a single type. Using the power of distributed tracing and a query engine designed for highly-contextual telemetry data, Honeycomb reveals both why a problem is happening and who specifically is impacted.
- Consolidate your logs and metrics workflows in one tool: Other vendors treat traces as a discrete complement to logs and metrics. Honeycomb’s approach is fundamentally different: wide events make it possible to rely on Honeycomb’s traces as your only debugging tool, consolidating logs and metrics use cases into one workflow. Honeycomb’s traces stitch together events to illuminate what happened within the flow of system interactions. And unlike metrics, which provide indirect signals about user experience, tracing in Honeycomb models how your users are actually interacting with your system, surfacing up relevant events by comparing across all columns. Also unlike metrics-based tools, Honeycomb’s traces never break when you need to analyze highly contextual data within your system.
- Dramatically speed up debugging: Speed up debugging by automatically detecting hidden patterns with BubbleUp. Highlight anomalies on any heatmap or query result, and BubbleUp will reveal the hidden attributes that are statistically unique to your selection, making it easy to determine what context matters across millions of fields and values. Because BubbleUp is an easy-to-grasp visualization tool, any team member can quickly identify outliers for further investigation.
- Get the full context on incident severity: Other solutions provide metric-based SLOs, meaning they simply check a count (good minute or bad minute?) with no context on severity (how bad was it?). Honeycomb’s alerts are directly tied to the reality that people are experiencing, so you can better understand severity and meet users’ high performance expectations. Honeycomb’s SLOs are event based, enabling higher-fidelity alerts that give teams insight into the underlying “why.” When errors begin, Honeycomb SLOs can ping your engineers in an escalating series of alerts. Unlike other vendors, Honeycomb SLOs reveal the underlying event data, so anyone can quickly see how to improve performance against a particular objective.
- Avoid lock-in with best-in-class OpenTelemetry (OTel) support: Honeycomb supports and contributes to OpenTelemetry, a vendor-agnostic observability framework that enables teams to instrument, collect and export rich telemetry data. Prior to OTel, teams were stuck using vendors’ proprietary SDKs; with OTel, you can instrument once and send to multiple tools if needed, avoiding lock-in. Using OTel’s automatic instrumentation for popular languages, teams can receive tracing instrumentation data with only a few hours’ work. Or, instrument manually to get even richer data and more flexible configuration. Engineers can also attach their existing logs to traces.
- Make costs predictable, without sacrificing system visibility: With Honeycomb, you simply pay by event volume—not by seats, servers, or fields—solving the tradeoff between system visibility and cost. Unlike legacy metrics and monitoring tools, Honeycomb enables engineers to capture unlimited custom attributes for debugging, with no impact on your spend. Honeycomb charges by number of events, not how much data each event contains or the way you analyze that data. There’s no penalty to instrument rich high-dimensionality telemetry or analyze high-cardinality fields.
- You can consolidate your metrics and logs analysis tools into a single line item (and single workflow), because Honeycomb’s traces contain wide events with all of your important debugging context. Add as many team members as you like—your costs won’t change.
- Reduce spend further without missing out on critical debugging data with Refinery, Honeycomb’s intelligent sampling proxy tool. Unlike legacy ‘blunt force’ sampling methods that can miss important context, Refinery can examine whole traces and intelligently keep what’s important, and sample the rest based on their importance to you.
Yes, observability enables high-performance engineering
Engineering teams continually strive for faster release times, but what happens when there is an outage? Time is of the essence, which is why observability engineering needs to be part of any modern software development effort.
Observability engineering surfaces data and anomalies allowing for faster diagnostics and resolution. Tooling, and a culture committed to answering the question “why?” are vital for successful observability engineering, but luckily that fits seamlessly into a modern DevOps practice.