Honeycomb + Google Gemini

By: Jessica Kerr | April 10th, 2024

LLMs Observability

4 Min. Read

Today at Google Next, Charity Majors demonstrated how to use Honeycomb to find unexpected problems in our generative AI integration.

Software components that integrate with AI products like Google’s Gemini are powerful in their ability to surprise us. Nondeterministic behavior means there is no such thing as “fully tested.” Never has there been more of a need for testing in production!

Honeycomb is all about helping you find problems you couldn’t before, problems you didn’t imagine could exist—the unknown unknowns. Charity showed how this applies to integrating with Gemini.

What’s the problem?

In the Honeycomb product, people can ask questions in natural language and get a Honeycomb query and result. We call this feature Query Assistant.

Honeycomb + Google Gemini Keynote: Query Assistant - Typing in a query

People type in stuff like “Which users have the highest latency?” and get a graph and table. This comes with a Honeycomb query they can modify.

Honeycomb + Google Gemini Keynote: Graph/table result of a query.

Query Assistant turns an engineer’s question into a valid Honeycomb query—most of the time. Is that good enough?

How good is the solution?

With Honeycomb, we can measure whether the feature is working. We have instrumented the entire operation all the way from when the user hits enter, our RAG pipeline, programmatic prompt construction, AI model call, response parsing and validation, query construction, and submission to our query engine. There’s a lot that can go wrong, even without the nondeterminism of AI!

Building GenAI apps? Read our best practices guide.

read NOW

All of this rolls up into a single Service Level Objective (SLO) we call “Query Assistant Availability.” It tells us that Query Assistant produces a valid query 95% of the time.

But what about the other 5%?

Where does it go wrong?

Charity scrolls down on the SLO screen, to the view where the magic really happens. We call this the SLO BubbleUp View.

BubbleUp looks at every Query Assistant request and compares every single dimension from the requests that violate the SLO against the baseline events that satisfy the SLO, and then sorts and diffs them so the differences come to the top.

This answers the question: “What is different about the ones I most care about?”

Charity sees that “Error” is one of the most different fields, and there are several different values. She mouses over one of them. It says “unexpected end of JSON input.”

Charity clicks to add this field as a filter, and now she has a graph of only requests with this error.

Now, she’d like to see the end of that JSON input. She types something like “also show the user input and response” into Query Assistant.

Query Assistant usually adds the right fields to the GROUP BY, and now they appear in the table. There’s something suspicious about the end of that app.nlq.response value.

Honeycomb + Google Gemini Keynote: There’s something suspicious about the end of that app.nlq.response value.

That response is truncated! Indeed, the JSON has ended unexpectedly.

What’s our limit on output tokens? Charity asks Query Assistant “also group by the output config.”

Looks like our configuration was just for 150 tokens from the LLM, and clearly that’s not enough!

That seems like an issue Charity should take to the team.

This is observability.

This flow shows just how important observability is when you’re building with generative AI. We found this problem to fix, but new ones will surely come up.

When you can use SLOs to identify something you find interesting, then slice and dice and explore your data, you can quickly get to the bottom of any issue—even if it’s totally new and unknown.

With Honeycomb, you can develop AI applications with confidence, in Google Cloud or any other environment. Try it today for free.

Don’t forget to share!

Jessica Kerr

Senior Manager, Developer Relations

Jess is a symmathecist, in the medium of code. She sees development teams as learning systems made of people and running software. If we make that software teach us what’s happening, it’s a better teammate. And if this process makes us into systems thinkers, we can be better persons in the world.

Austin Parker | Jul 01, 2025

Can Claude Code Observe Its Own Code?

One of the great things about OpenTelemetry is that it's a standard, and standards tend to proliferate. I was excited to see Claude Code add OpenTelemetry metric and log support in a recent release. What was really interesting—beyond the ability to capture usage data from Claude Code—is that you can also get pretty detailed logs about what you’re doing with Claude Code. For instance, how many tokens are being used for each model, which tools are being called, and the length of your sessions.

LLMs OpenTelemetry

Austin Parker | Jun 16, 2025

Tales From the Trench: Building With LLMs and Honeycomb

AI discourse these days is all over the place. Depending on who you talk to, AI’s are absolute flash-in-the-pan junk, or they’re the best thing since sliced bread. I want to cut through the noise, though, and see for myself what someone can do out here on the bleeding edge. Thus, I’m setting myself a challenge: write a usable—and useful—application with Claude Code, from soup to nuts.

LLMs

Austin Parker | Jun 09, 2025

It's The End Of Observability As We Know It (And I Feel Fine)

In a really broad sense, the history of observability tools over the past couple of decades have been about a pretty simple concept: how do we make terabytes of heterogeneous telemetry data comprehensible to human beings? New Relic did this for the Rails revolution, Datadog did it for the rise of AWS, and Honeycomb led the way for OpenTelemetry.

LLMs Observability

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Honeycomb + Google Gemini

8 Best Practices to Understand and Build Generative AI Applications Effectively

What’s the problem?

Where does it go wrong?

This is observability.

Jessica Kerr

Related posts

Can Claude Code Observe Its Own Code?

Tales From the Trench: Building With LLMs and Honeycomb

It's The End Of Observability As We Know It (And I Feel Fine)

Ready to get started?