Building GenAI apps? Get our best practices guide.
Generative AI is having a bit of a moment—well, maybe more than just a bit. It’s an exciting time to be alive for a lot of people. But what if you see stories detailing a six month old AI firm with no revenue seeking a $2 billion valuation and feel something other than excitement in the pit of your stomach?
Phillip Carter has an answer for you in his recent talk at Monitorama 2024. As he puts it, “you can keep being a hater, but you can also be super useful, too!”
Augmentation, not automation
The rapid growth in generative AI has drawn comparisons to the crypto bubble. Generative AI may be going through the Gartner hype cycle due to the way it’s being sold, but the underlying technology is actually quite useful across many domains like data analysis, code generation, and more.
“A lot of this stuff is being sold as automation. This tech is nowhere near good enough to actually automate legitimate things that people do,” Phillip says, “but it is good enough to augment a lot of stuff that people do.”
New large language models (LLMs) do represent a step change in capability. But that doesn’t mean they’re able to replace the human in the equation. As anyone who has seen an LLM tell a user they should eat one rock a day for their health can attest, having AI integrations reliably produce the correct results for unconstrained problems is practically impossible.
Machine learning and reliability
There’s a surprising amount of overlap between the reliability engineering and machine learning communities. As Phillip points out, they’re often interested in the same thing: building a system that works the way you expect it to.
Trying to make an AI model that can do a million possible tasks is a recipe for disaster (mmm, tasty rocks), but if that same model is applied to a problem with a narrow but clearly-defined scope, it can work exceptionally well. “If you make it a small enough scope, the models are really good at doing specific tasks,” Phillip says. To cross the barrier between a fun demo and a reliable product, instrumentation is vital. “If you don’t instrument that, your app is going to suck. Straight up.”
In the ML engineering space, it’s common knowledge that you need real data from how a model actually responds to inputs to make progress. This data is collected and annotated to form what’s called an evaluation set, and it’s constantly updated over time. Making a tweak to your prompt, or retrieval pipeline? Start with evaluations, and measure against those evaluations when you’re in production. ML engineers will use terms like “get a data flywheel going” where teams are not applying guesswork, but intentionally experimenting in production and using data to drive those experiments.
Machine learning engineers are often trying to solve the same problems with different and sometimes worse tools than the observability space. If you ask an ML engineer to help capture useful information from the system, Phillip says, they are more likely to be relieved than dismissive: “Yes, please, finally, I’ve been waiting for someone to care about this with me.”
Even with AI systems, you have to understand how the users are, inevitably, breaking the product: “That’s not a machine learning problem. That’s a software reliability problem.”
This isn’t an abstract observation; it comes from Phillip’s own experience building Honeycomb’s Query Assistant.
Honeycomb’s Query Assistant
Observability tools can be overwhelming for new users. The Query Assistant is a natural language AI tool that helps users turn their questions into query specifications on Honeycomb. Phillip was there building it from the ground up.
He points out that the LLM itself is not where most of the work happens when building a tool like this. Refining the data upstream of the LLM is where Phillip’s team spent most of their time. “If we don’t give [the model] good enough context in the first place, it has no hope of producing a good output that’s a useful query for someone.”
For example, a Honeycomb user may have multiple definitions of an error (such as columns like error and app.error; each of which track different things). However, a user who isn’t aware of this distinction may just ask about errors and expect a reasonable result. Honeycomb allows you to define a field of your choice to be the canonical representation of an error, but no large language model was trained on what a specific team configured! And so the question of, “what do we say is the error field?” is driven by context, not a more powerful LLM.
To make verifiable progress, the team at Honeycomb implemented tracing at every step of the Query Assistant feature. Every context-gathering operation—from simple stuff like the name of a dataset to more complex stuff like the vector and keyword-based search pipeline—has a span in a trace showing what was selected, the call to the LLM is a span, and there are spans tracking post-processing of results. The end-to-end trace tells a complete story of what actually happened when a user asked a question to get a query.
Phillip explains, “If you are calling a database or if you’re working against a set of documents, what are the ones that were actually selected to pull information out of? You can’t just pull down a language model and debug it yourself step-by-step.” Tracing this information is vital to improving GenAI applications over time.
Don’t just hate generative AI—make it suck less
Ultimately, Phillip is not out to convince the AI haters. He just wants them to work to improve the things they don’t like about generative AI. There’s a growing movement toward AI observability, he says, and the people who are naturally skeptical about software have an important part to play. “You can keep being an AI hater. But you can also be really useful, too. You have a lot to bring to the table to make these systems a lot better.”
Are you interested in seeing how an AI tool that doesn’t suck works? Learn more about Honeycomb’s Query Assistant. If you’re still on the fence, you can even chat with one of our own developers about their experience. And be sure to check out Phillip’s full talk for all the juicy details this overview didn’t go into.