Guest Post: Web Latency Spikes

Intro:

Our friend Issac Kelly has been working part time with us on marketing and copy. The first thing he did was set up Honeycomb for his own business. In less than 20 min, he had web logs streaming in to Honeycomb, and then discovered two unknown problems.

We were delighted, because surfacing unknown-unknowns is near and dear to our hearts. If you already know what’s wrong, the hardest part is often over! We asked him to write up what happened.—charity

Discovering and Exploring Web Latency Spikes

Last week I got a signup invite for Honeycomb. My consulting business has a website management platform with modest but steady traffic, about 30k req/day. I was interested in using Honeycomb to watch the day-to-day heartbeat of my application, and ideally spot anomalies before they become problems. When I immediately found two actionable problems I was floored.

Signing Up

I ran through the Quickstart, using curl and python to send my first few events. I found honeytail, and saw it had an nginx log parser which automatically inferred any custom log format from the config file. That is neat. I installed the package and started the agent, and then waited an hour for data to flow in1.

I let it run for an hour. I didn’t really know what to do when I got back to the Honeycomb console, but it had a pretty familiar SQL structure and pre-populated all the field names and calculations based on my data. Honeycomb actually saves the structure and results of all my queries, so I was able to go back later, retrace my path, and walk you through my thought process.

Exploring

I started with the default query hint: a COUNT(*) that displayed total requests per second over time. The default nginx logs are rather terse; I decided to add more useful information to my logs. I added $request_time to the log format, and restarted both nginx and honeytail.

Here is the nginx log format I ended up with:

log_format honey '$remote_addr - $remote_user' \
            '[$time_local] "$request" $status' \
            ' $bytes_sent $http_x_forwarded_for' \
            ' $host $request_time $upstream_response_time' \
            ' $pipe : "$http_user_agent" $server_port';
access_log /var/log/nginx/access.log honey;

Some events had a value for request_time and some had nils, but that was fine. I tried breaking down my query by $request_time and noticed there were spikes where the AVG(response_time) went well over a few seconds! This is a bad sign for a lightweight web site that isn’t serving a high volume of requests.

My honeytail invocation looked something like this. (The nginx.format argument is just the name of the log_format directive, seen above.)

$> ./honeytail/honeytail
    -k 10xhunter2a0
    -d 'API nginx'
    -p nginx
    -f /var/log/nginx/access.log
    --nginx.conf /etc/nginx/nginx.conf
    --nginx.format honey

I’m a pretty mediocre sysadmin—while writing this post I learned that AVG was not as meaningful a bucket as P95, P99, etc. Still, from this query I could see two spikes: one corresponded to a spike in the number of requests (at 3 PM) and one did not (4:15 PM).

Screenshot: Found a Spike!

The latency spike that didn’t correlate to a traffic spike happened again at 5:45PM, so I zoomed in to look more closely. The graph let me click and drag on the area around a spike to quickly zero in on that time range. I switched to the raw data inspector mode and saw that one page was taking about 5 seconds to render, which is how I discovered that a customer of mine had written a poorly formed template, causing slow queries and lots of rendering. I fixed the issue and moved on.

I added a filter for the problem URL to demonstrate how request duration dropped, while request rate stayed steady.

Screenshot: Response Time Drops

I also zoomed in on the 3 P.M. spike. I found out that they were intended for somebody else’s server. For some reason I am getting a stream of requests for somebody else’s server, because their DNS is pointing at my IP. The spike of requests was causing a latency spike for me, so I used nginx to ban requests from their originating IP so it couldn’t hit my backend Django services.

Conclusion

I solved two issues in one hour, neither of which I knew existed beforehand, and now I have a better feel for my baseline query latency and request rate. Now, as I build out more features, I can pay attention to how incremental changes affect performance.


  1. In retrospect, I didn’t actually have to wait an hour for the data to populate my dataset—I could have replayed older logs into the dataset using the same honeytail utility. [return]
Have thoughts on this post? Let us know via Twitter @honeycombio.