Student Beans Solves Issues, Levels Up Immediately With Honeycomb
5 minute read
With Honeycomb, teams find and solve problems fast, at a fraction of the cost of running an ELK stack
Student Beans partners with over 650 of the world’s biggest brands across fashion, technology, food, entertainment and more, and power a global network of students in over 150 countries to enable brands to ensure their student discounts are only available to verified students.
Environment
- ELK
- Ruby/Rails
- Go
- GRPC
- Kubernetes
- Some microservices, but mostly apps
Goals
The Engineering team at Student Beans works hard to keep up with the growth of their business, but as is often the case with the rapid scaling that comes along with such success, not everything has been smooth sailing. At the beginning of this year, end users and the brand partners they were connecting with began to report an increase in timeout errors shown on their website, at which time the team turned their full attention to understanding and resolving these issues.
At first, they tried to investigate using their existing tooling—an internally-managed Elastic instance, which was costing them on the order of $1500/month to run for themselves—with no real success. Elasticsearch would frequently spike to 200+ CPU, and fail to return useful information with which to troubleshoot. As a result, engineers didn’t know if the returned results were accurate or complete, which decreased faith in the system overall.
Because the logs weren’t always reliable in production, we weren’t using them consistently, we fell into the trap of not trusting the data and only logging what we knew about, the known unknowns at best.
They then looked into the option of getting a suitably sized and managed hosted ELK solution for their needs, but found it would cost them on the order of $15,000 a month—prohibitively expensive, and not necessarily guaranteed to resolve their issues.
If we were going to spend the money, we wanted to actually get the value.
What They Needed
- A way to collect the right data needed to investigate complex issues without breaking the bank
- The ability to trust the data they were collecting and query it successfully to identify the source of the errors being reported
Honeycomb @ Student Beans
Once they’d signed up for the free Honeycomb trial and installed the Honeycomb Beeline for Ruby, they found and fixed the problem immediately. A Rails upgrade several months ago had had the side effect of removing some automatic loading of a whitelist that was continually being added to, which slowed the entire lookup process being used to determine what content to serve. Over the course of the intervening months, this slowdown had become significant enough to affect the user experience. Before Honeycomb, they didn’t even know where to begin looking.
We solved the issue within 2 days of having Honeycomb installed, using the traces that come with the Beeline.
(Trace diagram showing long database load times)
Using Honeycomb allowed us to identify the exact line of code that was causing the long database load times. Honeycomb has already saved us enough money that we’re looking into doing much more with it.
Within a few more days, more engineers began to play with Honeycomb and reported finding and resolving all kinds of issues they couldn’t before, such as shaving 200 milliseconds off a call by fixing 2 lines of code, or instantly determining whether an issue was the result of bad code, a bad actor, or simply lack of capacity.
Honeycomb is building our trust in the data back up. And in terms of time saved—it’s not even a matter of time, hours vs literally minutes.
Now, the Engineering team at Student Beans plans to roll out Honeycomb usage to all their services, and has published their plans to the rest of the organization, stating in an internal report:
Honeycomb allows us to gain deep insight into how our products are performing technically. This insight includes information about how long requests for data take across the system. Using this data, we are able to set up alerting so we can more quickly take action if/when requests for data starting taking a long time and ultimately resulting in unhappy users.