Instrumentation: Worst case performance matters

This is the fifth in a series of guest posts about instrumentation. Like it? Check out the other posts in this series. Ping Julia or Charity with feedback!

BrightRoll’s Realtime team is responsible for a service, Vulcan, which provides a set of developer-friendly APIs for interacting with the data involved in deciding whether to serve an ad. It is a threaded, event-driven application written in C. As Vulcan is in the critical path of ad serving, it has a very tight SLA: <5 ms response time and 100% availability.[1]

One otherwise fine day, the ad serving team noticed that they were seeing a high timeout rate for all queries to Vulcan. So, we checked our performance metrics: everything looked pretty solid, with 95%ile figures well within SLA.

After a few false starts and blaming the network as per SOP[2], we decided to take a look at sample-based CPU profiling, which confirmed that 15% of time was going to user data record decompression – largely in line with our past experience. User data records include a long list of “segments.” These are used for targeted advertising, like “people who want to buy a planet ticket to Europe soon;” this adds up to lot of data, so it’s stored in a proprietary compressed format.

On a hunch, we extended the performance monitoring system to start tracking worst-case performance. That got our attention very quickly: the maximum times were over 15 ms! The latency distribution was awful:

%ile   Response time
95      2.5 ms
99      4.0 ms
99.9    5.9 ms
99.99   8.0 ms
100    16.4 ms

We quickly figured out that this response time was dominated by user data record decompression, and that decompression time was linear with the number of segments a user was associated with: the 95%ile users were only in 1,400 segments, but the worst-case users were in over 10,000! Clearly these users were either very strange (or associated with fraudulent behavior).

However, even accounting for slow decompression operations, we couldn’t account for the magnitude of SLA violation reported by the ad decisioning team.

The answer turned out to be simple once we saw it: this is an event-driven architecture. While a given thread is occupied decompressing a huge user record, new inbound requests queue up. They are not marked as being “started” until they are parsed and routed to the right endpoint, so even though from Vulcan’s point of view the transaction only took 1ms, it might have been sitting in queue for 10+ms!

Now that we knew where the problem was, one engineer knocked out a quick fix: we capped the number of segments per user at 2048, which cut the worst-case latency back to sane levels and largely eliminated the appearance of the thread stall problem.

We also implemented a longer-term way of monitoring SLA compliance, but that is a story for another day.

I think there are four good lessons here:

  1. Start paying attention to worst-case performance. Maybe you don’t have a system where 0.1 ms matters, but you’ll still find broken stuff.
  2. Make investments in tooling up front and follow a scientific method – hypothesis-test-repeat – in debugging performance problems. Chasing down rabbit holes isn’t an efficient way to fix your business.
  3. Performance measurement that occurs entirely within your application has serious limitations that you need to be aware of. (GC pauses, anyone?)
  4. Thread busy stalls in event-loop driven applications suck super bad.

[1] If Vulcan stops responding, business performance degrades dramatically. We really do aim for 100% availability.

[2] #protip It’s never the network. It wasn’t this time either.

Thanks again to Gian-Paolo Musumeci for their contribution to this instrumentation series!

Have thoughts on this post? Let us know via Twitter @honeycombio.