Observability in the Age of AI

Observability in the Age of AI

14 Min. Read

This post was written by Charity Majors and Phillip Carter.

In May of 2023, we released the Honeycomb Query Assistant, an LLM-backed feature that lets engineers use natural language to generate and execute queries against their telemetry data. Instead of having to master a domain-specific query language, you can simply type in things like “slow endpoints by status code” and the Query Assistant will generate a relevant Honeycomb query for you to iterate on.

When we launched Query Assistant, six months after the splashy release of ChatGPT, it was the first of its kind to hit the market. It was also a bit of a test balloon for us as a company. We wanted to get a firsthand sense of what it was like to build tools using generative AI, then understand the challenges of maintaining these tools in the face of a user base with widely mixed familiarity with AI patterns. Who would end up actually using these tools, and how would they impact the business?

From May until November of 2023, we published a steady stream of material discussing our experiences building Query Assistant (which took just six weeks to write and ship, from start to finish!), and all the things we learned about instrumenting and understanding software with LLMs in the course of doing so:

Honeycomb engineers were amongst the earliest adopters of this technology. Not in the widely parodied top-down, VP-mandated, “go be AI leaders nao plz” kind of way, but in a bottoms-up, experimental kind of way, driven by curiosity and fascination.

2025 is going to be an exciting year when it comes to Honeycomb and AI; we can’t wait to show you what we’ve got cooking.

The emerging Honeycomb perspective on AI in tech

Is there an AI bubble? Yes, almost certainly. However, in technology, the size of the bubble often correlates with the magnitude of its ultimate impact. AI is not magic, but it is a tool with many powerful applications.

We think there is a massive gap between the capabilities enabled by AI technologies and the realization of those capabilities on the market today. We are excited about the potential to relieve labor-intensive toil, especially around instrumentation, and massively accelerate time-to-value and time-to-insights across the board. The Honeycomb AI strategy has two prongs:

  • For people building with AI: helping them understand the performance, quality and customer experience of their models and software using LLMs
  • For every Honeycomb user: creating smart, intuitive workflows based on the depth of our data and our insights into how experts interact with their systems

Honeycomb’s superpower as a company has always been a one-two punch: first, build world-class teams and technology; second, bring the world along with us by showing our work.

One disappointing aspect of the current boom is how many companies are being incredibly closed-lipped about the practical aspects of developing with LLMs. Most leading AI companies seem reluctant to show their work or talk about how they resolve the contradictions of applying software engineering best practices to nondeterministic systems, or how AI is changing the way they develop software and collaborate with each other. They act like this is part of their secret sauce, or their competitive advantage.

I think this is really unfortunate, and places even more of an onus on the rest of us to share our work and advance the industry together as we learn. Towards that end, we thought it might be helpful to share with the world our read on the AI landscape, and some of the diagnoses that are guiding our product development plans.

If you can compute the answer, you probably should compute the answer

Right now, we see a lot of companies selling this idea that any product or feature that uses AI will be better than a product or feature that doesn’t, simply because it has AI—which is just ridiculous. There are plenty of products and features that have been made worse, not better, by using AI. 

In the observability market in particular, we’re seeing a lot of companies use AI as a hack under the hood to try and guess reasonable answers for questions that they can’t compute, because they didn’t gather their data in a way that preserves enough context to let them compute it. If they did collect the data in a way that preserved rich context and relationships, using structured data instead of metrics, they could deliver much better anomaly detection and outlier correlations (ironically, their AI-powered features would be better then, too). But they didn’t, so all they can do is guess.

If the question is a mathematical one, computing or calculating the answer will always be faster, cheaper, and more accurate than using AI to derive an answer. If the answer can be computed, in other words, it probably should be.

Sometimes all you can do is guess, and then yeah sure, a guess is probably better than nothing. But don’t fall for the lie that anything gets better if you sprinkle AI on it.

Generative AI is evolving from guessing to reasoning

In all the hype and hoopla and seeming feats of magic, one fundamental truth seems to be getting lost, which is that early forms of generative AI are a form of guessing. They don’t reason or calculate or compute—they are more like a very fancy auto-complete. 

This will not be true forever. Indeed, newer generations of LLMs and GenAI are developing reasoning capabilities that are coalescing into a new, weird, and sometimes janky kind of virtual computer. This “computer” is slow and wobbly (compared to traditional computers) at deterministic tasks like adding two numbers together, but quite fast at e.g. translating a question from Japanese into a JSON object that we can run as a Honeycomb query. As an industry, we are gradually feeling our way around this, because it can get stuff wrong a lot, but the kinds of things it can do were fundamentally impossible a few years ago.

Applying this tech towards anomaly detection might turn out to be useful, insofar as it can theoretically help discover more patterns and offer hypotheses around which patterns might mean something, but that means you need to have a different computing system to actually do the math and stats and data diffing and what have you to make it ultimately reliable. Figuring out which types of work should be done respectively by humans, computers, or genAI models is going to make the next few years very interesting.

The other thing we believe is that we need to stack an LLM on top of, underneath, or between other computer systems that we’re more familiar with. Don’t make an LLM add two numbers together, but maybe consider passing numbers it generates into a calculator so you can calculate and verify a final output.


Read Phillip Carter’s O’Reilly book: Observability for Large Language Models


AI can augment, not replace your engineers

A lot of people are raising eye-watering boatloads of cash on the idea that you can replace your engineers with support agents and AI. We think this is the wrong mental model (as well as being deeply problematic and possibly immoral). But how should you reason about AI and LLMs when you’re thinking about productivity and workflows?

This might be betraying my own background as a database engineer, but for me, the most useful way to conceptualize AI has been to think of it like a weird new kind of storage engine. In the 2010s, a wave of NoSQL databases changed the kinds of workloads we were able to model and run. In some ways, generative AI is another kind of storage engine, capable of applying the revolutionary powers of automation and computation to natural language and truly unstructured data, instead of just numbers and structured data.

Newer models are developing reasoning capabilities, which is moving them in the direction of being more of a “weird new kind of computer” than a “weird new kind of storage engine,” but I still find the analogy useful. Instead of grandly envisioning AI as replacing a person or joining your team, think about which workloads are currently being done by humans (or computers) that might be really well suited to the emerging strengths of generative AI.

As Fred Hebert says, “There is a long history around automation showing that best results are obtained when we use automation to second-guess or support people rather than either a) taking them out of the loop or b) when humans are in the role of validators.” One good example comes from radiology. If you hand it over to AI to perform the diagnosis, your actual radiologists swiftly grow less skilled and will tend to concur with whatever the AI said. But if you have people do the diagnostics and have AI sanity check their results and point out if it has conflicting conclusions, then you get the best of both worlds.

We’re all gradually feeling our way towards a range of use cases, whether that is accelerating high-value work, replacing YAML-generation and other low-value work, or stuff like transcription, etc. There are a lot of unknowns here, but the rule of thumb of automating, enhancing, and second-guessing your engineers rather than replacing them seems to serve us fairly well.

Disposable software vs the software that runs the world

We expect software will bifurcate into two categories. We think there is going to be a dramatic increase in use cases for disposable code, where the stakes are relatively low and nobody ever really needs to understand it. If your code isn’t working, just regenerate it repeatedly until it does a better job of passing your tests and the criteria you’ve defined for it. 

Then there is the code that is not disposable; the code that runs the world—banks, delivery companies, commerce, etc. This is code where individual outcomes translate into real monetary results, where code accrues value and trust over time by running and being stable over multiple rounds of iterating and revising, where you know it works and does the job because it has been working and doing the job for a long time. At the end of the day, someone, somewhere is going to need to understand that code. Always.

AI can assist with this in a number of ways; bootstrapping your instrumentation code, providing nudges and hints about how to improve instrumentation, learning from how experts interact with their corners of the system and surfacing those patterns to everyone, more sophisticated anomaly detection from analyzing common datasets, etc. These are the areas that Honeycomb is starting to seriously invest in from a product perspective. Especially around making it easier to get your data in and accelerating time-to-value.

Great AI observability must be grounded in great software observability

A lot of folks are raising buckets of cash to develop “AI observability” tools, but in our experience, you can’t evolve the model independently of the context of the rest of the software system it’s embedded in. You can’t understand your models in isolation. Context is everything when it comes to LLMs. You can’t have great observability for AI unless you start with great observability for the rest of your software.

We’ve talked a lot about observability 2.0 in recent months—the idea that the “three pillars” model for observability characterizes the previous generation of tools, which were largely based on metrics and scattered telemetry across disparate logging, tracing, metrics, dashboards, profiling, APM, RUM, and other tools, and tightly limited your ability to capture and preserve high-cardinality data or relational data. In comparison, observability 2.0 tools use a single, unified data source, and let you store a rich web of context and data relationships.

All of these aspects become even more important when applying observability to AI models or developing with LLMs. You can’t possibly hope to understand and improve your LLM code or models if all you have are aggregates and random exemplars (i.e., your observability 1.0 toolset). You need the ability to inspect outliers and preserve a chain of events, including entire feedback loops. Modern AI apps usually involve retrieval steps, which is a tracing problem by another name. LLM-as-router determines which subsystem receives information to do further processing based on the input given, and when that goes wrong, it’s a classic high cardinality problem. 

AI agents often gather context from many sources, and knowing why an agent finished or not is a classic high dimensionality problem. You need high cardinality, high dimensionality, and the connective glue of traces to ensure your AI apps do the job they’re supposed to do.

The three intersection points of AI and observability

There are three main areas where observability intersects with AI: 

  • Building and improving models. 
  • Developing software using LLMs. 
  • Helping teams grapple with the influx of software of, shall we say, “unknown provenance.” 

It used to be that you could rely on the fact that someone, somewhere wrote the code and understood the code they were merging to production—but that is no longer the case.

A lot of engineers confuse reading code, and understanding what the code is meant to do, with understanding what the code actually does. The latter cannot be done in the absence of instrumentation and production. You don’t know—you can’t know—what it actually does until you run it.

As writing code gets easier and easier, understanding code has gotten harder and harder. Writing code has never been the hardest part of software development; it has always been operating, maintaining, and iterating on it. In order to truly boost productivity, AI will have to get better at helping us with these things.

We have customers using Honeycomb for all three of these use cases today; however, the third is almost less of a use case and more a universal fact of life, if you happen to be developing software in the year 2024-2025. As we look to sprinkle AI pixie dust on our own product offerings, it’s where we are the most excited to bring innovative thinking that makes life materially better for our customers.

Building trust and showing our work

Our customers come to us for lots of reasons, but one of the biggest is trust: they trust what we have to say about the future of technology. We have been early champions for a lot of sociotechnical changes and transformations that have since gone mainstream—putting developers on call for their work, testing in production, making the world safe to deploy on Fridays, platform engineering, observability, and much more.

We take this seriously; your trust means a lot to us. There are a lot of companies out there spinning hypotheses and fantasies about artificial intelligence that might someday come true. There’s nothing wrong with that. But this is not the role we seek to play in the ecosystem. We stay grounded in the reality of code as it meets production. We care about building humane, high-performing organizations, and helping teams understand their software in the language of the business.

We aren’t investing in AI because everyone else is doing it. We’re doing it because we believe it will unlock outsized value for our customers. And we will continue to do our work and share our learnings out in the open, to help bring the industry along with us as we learn and grow.

Don’t forget to share!
Charity Majors

Charity Majors

CTO

Charity Majors is the co-founder and CTO of honeycomb.io. She pioneered the concept of modern Observability, drawing on her years of experience building and managing massive distributed systems at Parse (acquired by Facebook), Facebook, and Linden Lab building Second Life. She is the co-author of Observability Engineering and Database Reliability Engineering (O’Reilly). She loves free speech, free software and single malt scotch.

Related posts