Growing pains can be a natural consequence of meteoric success. We were reminded of that in our recent panel discussion with SumUp’s observability engineering lead, Blake Irvin, and senior software engineer Matouš Dzivjak. They shared how SumUp’s rapid growth spurt compelled them to change their resolution process—both logistically and culturally—to ensure a service level quality that reflects their customer obsession.
SumUp is a financial services company that enables businesses to process orders, collect payment, and manage money through card readers, points of sale (POS), business accounts, and invoices. A few years ago, SumUp started building POS software. As its customer base and the services they provide grew, so did the complexity of their systems. As a result, accelerated customer support contact—and eventually, churn—signaled service quality issues, like latency. But the high cardinality of their data, combined with the sheer volume of service alerts, made it difficult to determine where individual problems stemmed from.
Looking back on Q2 and Q3 of 2022, SumUp said it hit a point where it had acquired too many customers for its current infrastructure and incident response methodology. This growing pain is what led Blake’s team to experiment with Honeycomb, along with OpenTelemetry, for the duration of Q3.
And so it began their journey to observability.
Untangling alerts… in the dark
When SumUp started building POS software, a massive number of alerts deterred team members from on-call duty. No one wanted to take the pagers—there were simply too many calls. The company knew it had crucial service issues that weren’t only affecting customers, but also putting a tremendous strain on its engineering resources. Searching through logs and metrics provided by outdated solutions took way too long, and in the end, didn’t always provide actionable information.
“We wanted better correlations between (our) data and what our systems were doing,” Blake explained. “The old question, ‘Is our site up or down,’ that’s not good enough anymore. We needed to understand how error rates were affecting things and how request latency was affecting customer experience. We were really struggling to do that with only metrics and logs, and we started looking for a tracing solution.”
Further complicating the alert, service, and site functionality situation was the company’s expanding footprint. Globally-distributed data centers made it harder to see what was happening. And, a culture of siloed teams owning different systems and using different tooling made it even more difficult to get a comprehensive view. Plus, a single SRE team was tasked with managing every issue that arose, which became increasingly unsustainable.
Along with tracing, Blake’s team decided to turn to service level objectives (SLOs) to solve the pain points of SumUp’s alert fatigue. The goal was to get ahead of issues before they became noticeable to customers. However, implementation would also require a cultural shift in which each team would have the autonomy to resolve issues in its domain.
Blake’s team ran some experiments with Honeycomb to see if specialized SLOs could help them, and to determine if a tech budget-based alerting system could improve the on-call picture for engineers—regardless of whether or not they had deep operations experience. This wasn’t just about making their lives easier, but also about controlling the error burn rate.
Turning on the lights to SLO the burn
The work of one team could affect others. For example, the identity services team could impact the POS domain—and vice versa. With controlling the error burn rate in mind, Blake tested Honeycomb’s ability to connect and show different teams the data they needed, even if the domains were only loosely coupled.
The focus shifted from a single service to the whole user experience. A consistent burn, therefore, indicated tech debt. “SLOs become a negotiating and prioritization tool for your engineering teams… It forces discussions that, without the SLO, you wouldn’t have until a customer complained,” Matouš said.
Using Honeycomb allowed the SumUp team to fully utilize its high-cardinality data, understand what was consuming its tech budget, and answer questions in a way that facilitated action. Development teams could do this without having to understand another team’s systems or services. This laid the groundwork for a more collaborative culture. With this successful cultural shift, suddenly product was talking to engineering and other business units, and it became easier for teams to plan ahead.
“It was like lighting up the Christmas tree, bringing Honeycomb in and having it provide a full view into what’s happening,” Matouš said.
Putting a spotlight on latency
To improve performance and reduce customer churn, Blake’s team needed to connect the dots across SumUp’s departments and services to zero in on the root of the problem. The team turned to Honeycomb’s heatmaps, a visual tool that shows the statistical distribution of the values in a dataset column over time. Taking months’ worth of data, the heatmap details the duration of events in milliseconds across the SLO window. The goal was to track latency for a suite of services that touched other SumUp products against a “latency budget,” measuring performance against a threshold.
Cue the spotlight. The team had trouble maintaining its latency budget and pinpointed an area where the heatmap was flatlining. This was a sign that the system was hitting some kind of saturation. After investigating further with BubbleUp, Blake’s team discovered that one system was feeding requests to another system. For security reasons, that system was timing out. The team addressed the issue by redesigning the database query.
“What was really great about using Honeycomb was the BubbleUp feature, which lets you select a band of events and then see which attributes are common to that selection,” Blake explained. Only approximately .001% of SumUp’s customers were experiencing latency problems caused by the database query issue. Fixing it at that point avoided impact to millions of customers.
As SumUp grew from serving smaller customers, like single-person businesses, to larger enterprise customers, widening cardinality was not a deterrent. Teams could filter their queries and examinations by use case, maintaining their ability to find root causes fast.
“We couldn’t have pulled that off with more traditional signaling, like logs and metrics. We would’ve had to do too much DevOps training to make that work,” Blake said. By Q4, the engineering teams were fully instrumented with Honeycomb—and the changes were, as Blake described them, “huge.”
Change that pays off
Keeping with SumUp’s goal to spot issues before its customers do, the company periodically compiles an internal customer satisfaction report to catch churn-causing trends early on.
Prior to introducing Honeycomb (and OTel) for observability, SumUp was relying heavily on legacy logging and metrics solutions. “I think the biggest difference, from my point of view, is that Honeycomb doesn’t spoon feed you as much as New Relic does. You might have a steeper learning curve, but you have more flexibility,” explained Blake. “New Relic saved my life back in the day, but I don’t want an opinionated tool. I’m happy with the open-ended, very professional tool, which is Honeycomb.”
“We were focusing on service quality and getting Honeycomb and OTel everywhere,” Blake described. “We did see a huge change between Q3 and Q4 last year. I don’t think that was just because of Honeycomb, but Honeycomb helped make that happen for sure.”
Now, “Honeycomb is kind of my first stop,” Blake said. “It’s how I orient myself around the problem, how I figure out who to involve, and how bad the merchant impact is.”
Looking for tips on how to make SLOs work for your organization? Check out our blog on employing actionable SLOs based on what matters most. After that, sign up and give Honeycomb a spin.