When you’re just getting started with observability, a proof of concept (POC) can be exactly what you need to see the positive impact of this shift right away. Coveo, an intelligent search platform that uses AI to personalize customer interactions, used a successful POC to jumpstart its Honeycomb observability journey—which has grown to include 10,000+ machine learning models in production at any one time.
Wondering how Coveo got there? So were we. That’s why we recently sat down for a technical session with Coveo’s Mathieu Fortier, Director of Machine Learning, and Andy Edmond, Principal Software Developer. They shared their Honeycomb journey, from POC to performance and productivity improvements.
The power of distributed tracing
“I was very interested in the idea of using distributed tracing and structured logging instead of raw logs and normal telemetry metrics,” said Andy. It was during Coveo’s annual coding competition that Andy saw a chance to demonstrate the value of observability to the rest of the company. Coveo uses a huge platform to handle all the competitors and their code, and many aspects of it need monitoring. “But it’s hard to diagnose with logs and metrics. So, I integrated Honeycomb into the platform, and once I tried it, I knew there was no way I’d be debugging any distributed application without that kind of tool anymore. I saw how powerful distributed tracing could be.”
Rolling off logs with Honeycomb Trace View
Andy shared the results of the POC with the rest of the team, and right away, Mathieu saw the potential of distributed tracing for the Machine Learning (ML) group. “Given the number of models we have, the amount of stuff we need to keep track of, it was really an aha moment for me. The scale at which we have to monitor is just way too much to see clearly without a tool like Honeycomb,” said Mathieu.
With Honeycomb’s Trace View, Andy and Mathieu have visibility into their platform in a way that’s impossible with just logs. “When you have raw logs, you have all these things that are displayed at the same time, whether it’s different threads or different queries. It’s really hard to follow what’s going on,” Andy explained. But Honeycomb’s distributed tracing with spans makes the causality—the relationships and their cause and effect—clear, even for those who don’t know the system.
Mathieu added, “When you have structured logging and tracing, you’re able to easily pinpoint what the issue is or instrument your code to say, ‘okay, this is what’s going on.’ It’s a magnitude faster to answer questions with these tools than it would be with logging or plain metrics.” Thanks to Honeycomb, Coveo eliminated the ping-ponging of problems between teams, saving hours that used to be spent on finding and remediating issues such as latency.
A spoonful of honey makes the latency go down
Many of the issues that Coveo found and fixed resulted in user latency, and they were able to create dramatic improvements. For example, Andy’s team uses instrumentation in Spring to time methods. In one instance, the call appeared to be very fast, but the global answer time was slow. With Honeycomb, the team discovered that an authentication system was called first. When they looked at the trace, they saw that the authentication server took 800 milliseconds to respond. “That’s a lifetime for a search,” Andy said. “And while it was for a very specific use case, you don’t get that kind of information from other tools. I didn’t get it from metrics, and logs would’ve been a pain.”
The team also found a hidden DNS issue using Honeycomb that they weren’t even aware of, and more importantly, before their customers became aware. By fixing the issue, Coveo was able to greatly improve call times. Before Honeycomb, it would have taken Andy’s team six months of optimization work to achieve. “And it was on nobody’s radar,” he added.
Getting a granular view of global performance with high-cardinality observability
More than 10,000 ML models running in production means more than 10,000 model IDs. With Honeycomb, Coveo can query its high-cardinality data to quickly zero in on specific issues that would be nearly impossible to find otherwise. “You can’t do that with logs unless you want to use Excel and a month of your life. And you can’t do that with metrics, unless you want to spend a lot of money,” said Andy. “Using Honeycomb, you send your span with your attributes with all the cardinality you want.”
Because of Honeycomb’s unique ability to parse massive volumes of high-cardinality and high-dimensionality data, users are able to pinpoint issues at a granular level—very quickly.
“In one region, there’s one big client and many smaller clients. We care about all of them. But because the big client is so big, when we look at global metrics of how fast we’re answering query suggestions, that client dwarfs everything else,” Andy explained. “By using a tool like Honeycomb that provides me with the capacity of adding every client and grouping them by organization ID, I’m able to spot issues with specific clients right away.”
Less pain, more time for developers? Sweet.
Reduced user latency is great for Coveo’s clients. But what’s the benefit of using Honeycomb for Coveo’s developers? We asked Mathieu.
“I think it’s unequivocal for everybody. Every single person without exception has… recouped time that was previously lost investigating. I think it’s night and day.” He added, “If you’ve been out there for long enough, you know the pain of investigating very complex systems. I see how much time Honeycomb can save us.”
Want to learn more about how other leaders have incorporated distributed tracing into their orgs? Check out our blog, How Three Companies Implemented Distributed Tracing. If you’re ready to give Honeycomb a try, sign up for free.