How Does ‘Vibe Coding’ Work With Observability?

You can’t throw a rock without hitting an online discussion about ‘vibe coding,’ so I figured I’d add some signal to the noise and discuss how I’ve been using AI-driven coding tools with observability platforms like Honeycomb over the past six months. This isn’t an exhaustive guide, and not everything I say is going to be useful to everyone—but hopefully it will clear up some common misconceptions and help folks out.

Demystifying vibe coding

So, what is ‘vibe coding’ anyway? Like I intimated above, it’s really about using AI-driven coding tools as the key part of a software development workflow. I’m not sure it’s really describing anything that new to be quite honest—old heads will recall when things like IntelliSense and IDEs in general were dumbing down an entire generation of developers—but it’s the words we’ve got. I honestly don’t think it’s worth getting bent out of shape over! There’s a lot of semantic overlap with the phrase as well, especially when it gets mixed into the general morass of ‘AI development’ that’s very popular right now.

What’s clear is that if you’re not extremely plugged into the space, it can be really challenging to build a model of what people are talking about when they’re talking about ‘AI development.’ I’m going to introduce you to the taxonomy that I use, which will hopefully help clear things up.

‘Training AI’ is a broad category, but I mostly use this to discuss people developing machine learning models, or primarily machine learning-driven applications, whose primary purpose is to provide a model or model-oriented functions (such as classification, vision, etc.) to other developers or applications. I also tend to group a lot of secondary or downstream model-related work here—such as refining models through embeddings and fine-tunes.
‘Building AI’ is an even broader category, but the distinction between this and training has a lot more to do with inputs and outputs than anything else. Training is, fundamentally, an ML task. You’re taking in data and creating a model that does something. Building is about taking those models and turning them into applications that do things for people. ‘Agentic systems’ are a good example, or integrating AI functionality into existing line of business applications.
‘Building with AI’ is the broadest of these categories, and honestly can encompass most productive work done by software developers. This is the land of leveraging LLMs such as Claude as a ‘programming partner,’ either through advanced autocomplete functionality, chat interactions, or agent-based workflows that can do everything from making Git commits to creating new files on your behalf.

One thing that unites all of these concepts is that observability is foundational to each, but it’s applied in different ways. Training models, or building AI features into existing applications, obviously benefit from observability (and in many ways, we’re seeing the ML space speedrunning the hard-won observability lessons that SREs have spent the past twenty years or so figuring out), but it’s a bit less clear what that means when it comes to building with AI, and that’s what I want to discuss.

Where does AI excel?

There’s a popular misconception that I see regarding AI-assisted development—you can think of it as a category error. The thought process is that if the AI is bad at one unexpected thing, it will be bad at other expected things. Alternately, that because the AI is good at an expected thing, it will be good at unexpected things. Both of these trains of thought miss the mark. Much like any other task, AI obeys the principle of garbage in, garbage out. There’s a skill to using it like any other tool, and the delta between skilled and unskilled use is pretty severe.

I’ve found the following three things to hold pretty true in regards to fitness:

Do I have good documentation for what I’m trying to get the AI to do?
Do I understand what ‘good’ looks like for the task I’m about to propose?
Do I have a way to create a fast feedback loop between myself, my code, and the AI?

If all of those are true, then your odds of success go up quite a bit when it comes to the code output of AI models. Let’s touch on them briefly.

What does good AI documentation look like?

Documentation written for an LLM is, unfortunately, kinda different than documentation for a human. The biggest reason why is mostly due to the information architecture of human-designed docs. Humans like to use links, and to split things up conceptually across multiple pages. This isn’t useful, though, when feeding documentation into an LLM context window. It’s more valuable to create a single documentation file that includes everything the model might need to know (code snippets, method declarations, API definitions) and have that at hand to paste into a prompt. If you have something that’s too complex for a single documentation file, make sure that each documentation file is more or less self-contained to a service or domain, and choose the appropriate one based on what you’re working on. For a practical example of this principle, check out llms.txt.

I’ve also found that AI is remarkably good at reading structured documents like UML in order to interpret relationships between components. Try that out as a ‘base truth’ for your design.

Finally, it’s good to have an ‘observability’ documentation file that you can throw into the context window. This can be as simple or complex as you want, but it should contain basic rules about logging, metrics, and traces you’d like to emit along with any patterns you want to use. As more tools support things like .cursorrules, it’s probably a good idea to create a rules file that you can share between multiple projects. In addition, take advantage of OpenTelemetry Semantic Conventions and their documentation. You can find dozens of markdown files that exhaustively document telemetry conventions in this repository. It’s a great thing to have on hand and feed into your context window if you’re trying to ensure quality instrumentation.

What does ‘good’ look like as a result?

Unless you have a good idea of what success looks like, you’re gonna have a rough time building with AI. This is an area where observability is also crucial, because it allows you to visualize what’s happening in your code.

I’ve found that using vision aspects of models is actually an amazing hack here. ‘Show, not tell’ works surprisingly well. Beyond the UML diagrams that I mentioned earlier, showing a picture of what your UI should look like as a wireframe is remarkably effective. Showing a trace waterfall, or a profile flamegraph, to a model is strikingly effective at passing in context about the current state of a service. If you want to get creative, write some tools to do this yourself, or leverage things like Model Context Protocol to allow the model to call an endpoint that returns an image.

Beyond this, you also need to know what you want the model to do. Having a solid grasp of design fundamentals (both software engineering and UX) is incredibly important. It’s not enough to just tell the model “Go off and do x.” Being able to tell it how to do x, or informing it about important constraints, is very important to successful task completion.

I want to dwell on this a bit in light of a blog about teaching Claude to play Chess that I read. It’s a great read, so go check it out, but one thing I took away is that the more novel your ‘world state,’ the more difficulty a model will have making good decisions about it. The LLM, like any computer, will happily do exactly what you tell it to. You’re in the driver’s seat, you have to know what you want—and how to get there. You might not need to know the details (but that’s what the LLM is pretty good at figuring out!), but you have to be able to point the models towards viable solution paths. Have a good grasp of the domain you’re trying to work in, have a really crisp north star of what you’re trying to achieve, and practice really clean contracts between services. You’ll find success more often than not, in my experience.

A final note here: smaller, iterative changes are almost always better than large ones. Design small, stateless, self-contained units of functionality and stitch them together.

How do I build fast feedback loops?

I’ve found that managing feedback loops for development with AI is the biggest key to being productive with the tech. You want to be able to run things locally, ideally with as little abstraction as possible between the model interface and the running code. Detailed logging for local development is also really important. I’ve found that typed languages tend to accelerate my productivity as well, as it turns runtime issues into compile-time ones.

Where things get trickier is when you move off your laptop and into production. This is an area that OpenTelemetry really helps with, because you can get quite a bit of baseline instrumentation and telemetry for free. Additional documentation about your telemetry setup will allow the model to add as much detail as you need to your spans, metrics, and logs. As I mentioned earlier, the other advantage is that you can feed the output of this telemetry back into the model through vision, in order to show the model where problems are occurring. It’s a really neat trick, and it speeds up the model’s ability to detect problems without requiring highly verbose logging in production.

If you’re feeling froggy, this is another area where tool use becomes crucial. I built an MCP server for Honeycomb that exposes a lot of functionality for querying my telemetry data. It’s been really helpful for testing stuff out in production, or validating changes locally against what’s happening on the server. I’ve even sometimes been a bit surprised, as Claude Code is able to figure out when it should use a tool that it knows about and will happily wander off to query Honeycomb for details about a service. Being able to get insights into what’s happening on the server without having to jump into the Honeycomb UI is rather nice! I suspect that there’s also a lot of room to do stuff like task an agent with addressing SLO burn or trigger state, but that’s something that I haven’t explored as much yet.

What’s my stack?

I’ve been using a mix of tools over the past six months: Windsurf, Cursor, Claude Code, Zed Assistant, Copilot, and even local models. I haven’t really found anything that’s head and shoulders above anything else (well, Claude Code is very good), but what I have found is that by getting better at interacting with the model, I’ve been able to become more effective.
I think we’re going to be in a period of rapid development and iteration in this space for another year or so at least, so I’m excited to see what’s next. If you’d like to share what’s working for you with AI-assisted development, let me know on Bluesky!

Don’t forget to share!

Austin Parker

Open Source Director

Austin Parker | Jul 01, 2025

Can Claude Code Observe Its Own Code?

One of the great things about OpenTelemetry is that it's a standard, and standards tend to proliferate. I was excited to see Claude Code add OpenTelemetry metric and log support in a recent release. What was really interesting—beyond the ability to capture usage data from Claude Code—is that you can also get pretty detailed logs about what you’re doing with Claude Code. For instance, how many tokens are being used for each model, which tools are being called, and the length of your sessions.

LLMs OpenTelemetry

Austin Parker | Jun 16, 2025

Tales From the Trench: Building With LLMs and Honeycomb

AI discourse these days is all over the place. Depending on who you talk to, AI’s are absolute flash-in-the-pan junk, or they’re the best thing since sliced bread. I want to cut through the noise, though, and see for myself what someone can do out here on the bleeding edge. Thus, I’m setting myself a challenge: write a usable—and useful—application with Claude Code, from soup to nuts.

LLMs

Austin Parker | Jun 09, 2025

It's The End Of Observability As We Know It (And I Feel Fine)

In a really broad sense, the history of observability tools over the past couple of decades have been about a pretty simple concept: how do we make terabytes of heterogeneous telemetry data comprehensible to human beings? New Relic did this for the Rails revolution, Datadog did it for the rise of AWS, and Honeycomb led the way for OpenTelemetry.

LLMs Observability

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission