Summary

Note: all times are UTC unless otherwise noted.

On Thursday, May 3 starting at 00:39:08 UTC (Wednesday 17:39 PDT) we experienced a nearly complete outage of the Honeycomb service lasting for approximately 24 minutes. Most services came back online at 2018-05-03 01:02:49 and all customer-facing services were fully restored by 01:07:00.

Impact

  • During the outage, customers were not able to log in or view graphs on the Honeycomb service.
  • Events sent to the Honeycomb API during the outage were refused at elevated rates; approximately 86% of API submissions during the outage were not accepted, and approximately 81% of events that should have been submitted were not recorded during the outage. (The percentages differ because a single API submission can carry multiple events and the outage did not affect all datasets equally.)
  • Since the Honeycomb API did not report success, some events remained stored on customer servers and were successfully (re)submitted after the Honeycomb API became fully available again.
  • Approximately 15% of events were successfully stored during the outage.

We apologize for this outage, and to every customer that was affected. We take our responsibility for data stewardship very seriously and are making multiple improvements to our systems to ensure this kind of outage does not cause data loss in the future, and to ensure that we can restore service more quickly in the event of a similar failure.

What happened?

In retrospect, the failure chain had just 4 links:

  1. The RDS MySQL database instance backing our production environment experienced a sudden and massive performance degradation; P95(query_time) went from 11 milliseconds to >1000 milliseconds, while write-operation throughput dropped from 780/second to 5/second, in just 20 seconds.
  2. RDS did not identify a failure, so the Multi-AZ feature did not activate to fail over to the hot spare.
  3. As a result of the delays due to increased query_time, the Go MySQL client’s connection pool filled up with connections waiting for slow query results to come back and opened more connections to compensate.
  4. This exceeded the max_connections setting on the MySQL server, leading to cron jobs and newly-started daemons being unable to connect to the database and triggering many “Error 1040: Too many connections” log messages.

The resolution had equally few moving parts:

  1. The RDS database suddenly experienced a massive improvement in latency and throughput; in the span of 20 seconds P95(query_time) for writes dropped from >600 ms to <200 ms, while total throughput rose from <500 OPS to >2500 OPS.
  2. Queued events completed rapidly, saturating the database at >3000 OPS for 70 seconds, after which the system returned to normal behavior: 300-500 OPS, <10 ms write times.

The story of how we got that answer (and identified a theory for the underlying failure and resolution) is an interesting example of using Honeycomb to debug a production system:

Retrospective Root Cause Analysis

The next morning during our post-mortem meeting, two theories were put forward: Either

  1. the elevated connection count caused the outage, somehow causing the database to run slowly, or
  2. the database was running slow for some other reason, causing elevated connection counts.

We worried that there might be some bug hiding in our application (or one of the Go libraries we use) causing our app to “pound down the door” when unable to connect to the database, leading to concern that if the situation happened again we might trigger the same outage. Everybody agreed it was more likely that an underlying database issue (storage, CPU, or connectivity) was at root cause, but we also agreed it was more risky if we ignored a potential application-stack bug by waves hands BLAME THE NETWORK!!!

Rule of thumb as a developer: it’s probably not the database, it’s probably your code.

To reduce the uncertainty and risk, we decided to reproduce the Error 1040 scenario in a controlled environment and observe the behavior of the system. Reproducing the connection overflow on our dogfood cluster was a matter of a shell loop and a few MySQL CLI processes and clearly demonstrated that while running out of connections does impact the app and cause cron jobs and similar to fail, it does not lead to runaway CPU use nor elevated latencies.

We now had two datasets: the outage on production, and the experiment on dogfood. Since we use Honeycomb to observe Honeycomb it’s easy for us to compare A versus B, in this case

Dogfood Experiment Production Outage

On the left, the dogfood experiment ran from 12:30 to 13:23 (it’s hard to see evidence other than the missing cron-job spikes), while on the right, the production outage is clearly visible. Experiment has a null result: we have not shown that Error 1040 caused the outage. More likely, then, something underlying the system is at fault.

With this information in hand, we can dig into the production data more deeply. Because Honeycomb datasets are full-fidelity (we don’t do any pre-aggregation or pre-computation) it’s possible to zoom in down to the single-second level and pivot on arbitrary keys to extract patterns. Here’s the onset of the performance anomaly as recorded via rdslogs:

graph of rdslog performance showing onset of the anomaly

The 15-ish seconds of no activity, followed by a bunch of completed actions with query_time values up to 15 seconds, looks fairly significant. The end of the performance anomaly also has an interesting heatmap pattern:

heatmap of rdslog performance showing end of the anomaly

To summarize, the database exhibited 23+ minutes of consistent low-throughput, high-latency behavior, bookended by dramatic transition zones lasting less than 30 seconds, preceded by normal high-throughput, low-latency, spiky application-driven behavior, and followed by a large volume of “catch-up” events afterwards, finally transitioning back to normal high-throughput mode. While this doesn’t constitute a full root cause analysis, it is a definitive enough pattern for us to conclude that the root cause lies within the database system and not in our application code; our app looks to have done the right thing. We further pivoted on the SQL normalized_query strings to validate that our app code worked as expected during the recovery process. Armed with this information we re-investigated our RDS configuration and confirmed that

  • we have Multi-AZ turned on.
  • the instance dashboard did not indicate any health events.
  • AWS documentation states that Multi-AZ does not take performance into account for health checks.

Lessons Learned

  • Managed databases are great, except when they’re not.
  • We will prioritize a “game day” fail-over of our dogfood RDS to revalidate our understanding of how our system performs when (inevitable) hardware failures happen. While we don’t expect more RDS hardware failures, it will be empowering to have a well documented playbook to execute when we do need to handle an RDS AZ failover.
  • We are improving our ingest pipeline to ensure that we buffer rather than fail to accept customer events when our infrastructure is unable to access the database. Hardware failures happen, that’s a fact of life; losing data doesn’t have to happen.

Timeline (2018-05-03 UTC)

00:39 - outage starts

00:40 - first alert

00:42 - engineers start investigation

00:50 - escalation, multiple investigators looking at several potential avenues

00:55 - status.honeycomb.io updated to keep our customers informed

00:58 - first engineering interventions to attempt to heal system, minimal impact

01:03 - outage resolution starts

01:04:23 - resolution completes, system stabilized

01:15 - engineers conclude that outage is resolved, update status.honeycomb.io

Thanks for following along. We’re grateful to you for your patience and to our colleagues for help during this outage.

You’ve always been able to get observability for your Ruby apps by instrumenting them with our SDK, affectionately known as libhoney. Unfortunately, instrumenting code you’ve already written is nobody’s favourite job. If only there were some way to automate the repetitive parts, so you could get instant insight into what your app is doing in production, and then focus your effort on augmenting that insight with the information that’s unique to your app!

gif of mechanical honeycomb press

That’s why we’re releasing our Rails integration. Out of the box, Rails web apps provide a rich stream of data about what they’re doing and how they’re performing. By connecting that stream to Honeycomb, you get the realtime observability, insight, and analytics that you’ve always wanted.

Get the data flowing with just a few clicks

Getting started is easy. Just add gem 'honeycomb-rails' to your app’s Gemfile, and a short config snippet to identify your Honeycomb account:

HoneycombRails.configure do |conf|
  conf.writekey = 'YOUR-HONEYCOMB-WRITEKEY'
end

That’s all you need to instrument your app’s requests and ActiveRecord queries. Now you can start asking questions like:

  • Which controllers and actions are the slowest to return in my app?
  • Which routes spend the most time making database calls? What are the commonalities among the slowest-rendering pages in my app?
  • Which users have seen that “Failed to save” flash error message the most lately? What other errors have they run into?
graph showing count and p95 of controller action durations

Honeycomb lets you ask these questions of your app as it’s running in production. Once you’ve got an answer, you can refine the question: e.g. try filtering to only the most common error, or breaking down by user.

Add your own custom fields for even greater insight

However, the real power of observability comes when you augment your instrumentation with context that’s unique to your app. For example, if your app deals with posts, you might want to understand the performance profile or error distribution associated with displaying a particular post. You can do that by adding fields to the honeycomb_metadata hash in any controller method:

class PostsController < ApplicationController
  # in any controller method
  def show
    @post = Post.find(params[:id])

    # The request event will be augmented with post ID
    honeycomb_metadata[:post_id] = @post.id
    # ...
  end
end

Want observability for your Rails apps? Try out our Rails integration and get started for free.

About

FENDER MUSICAL INSTRUMENTS CORPORATION: Since 1946, Fender has revolutionized music and culture as one of the world’s leading musical instrument manufacturers, marketers and distributors. Fender Musical Instruments Corporation (FMIC), whose portfolio of brands includes Fender®, Squier®, Gretsch® guitars, Jackson®, EVH® and Charvel®, follows a player-centric approach to crafting the highest quality instruments and musical solutions across genres. FMIC is dedicated to unlocking the power of music through electric and acoustic guitars, amplifiers, pro audio, accessories and digital products that inspire and enable musical expression at every stage, from beginners to history-making legends.

Environment

  • AWS Lambda microservices written in Go
  • AWS Lambda sending Cloudwatch logs to Honeycomb

Goals

“Why would you enter into an online space as a guitar company? It’s to know more about players: understand who they are, talk to them, and guide them along the way.” - Ethan Kaplan, Digital GM, Fender (http://fortune.com/2017/11/19/fender-guitar-apps/)

Fender’s customer focus makes observability a key requirement: to understand and support your users, you have to have visibility into their experience of your service.

“The goal of the platform team is to create and support services that support our web and mobile applications and provide data and analytics to the business. In that way we are able to understand and support players.” - Michael Garski, Director of Software Engineering, Fender

What they needed:

  • A powerful and intuitive interface for debugging and troubleshooting problems
  • Fast search results across high-cardinality fields such as unique customer IDs

Honeycomb @ Fender

The Fender Platform team maintains their own ELK environment to aggregate logs and investigate problems, but have found the platform slow and difficult for team members to learn to use.

“We have an ELK stack, but the query language is cumbersome and the UI is not as easy to use as Honeycomb.” - Michael Garski

They installed Honeycomb specifically to get the benefit of searching across all their Cloudwatch logs from Lambda at once, and have been especially pleased with the speed at which they can zero in on an issue.

“Honeycomb and Lambda work very well together. Without Honeycomb, it was very difficult to get all the Cloudwatch logs correlated, but now we can just do a quick search by request or userID with no difficulties.” - Michael Garski

Recently, the Fender Platform team deployed an update to the subscription management service for Fender Play users. Soon afterward, they noticed errors related to the billing systems, which started to spike over the course of the morning.

Using Honeycomb, they determined that the issue was related to differences between their test and production environments and could confirm, thanks to the Honeycomb Marker they’d set, that this problem was definitely caused by their recent deployment. They were able to roll back to a stable environment within five minutes.

graph showing rollout errors

This graph shows the Lambda function status codes during the timeframe of the deploy. The spike in 500s occurred after the first Marker, then dropped after the second, which was the rollback. Subsequent Markers show that a third deploy still had the issue, but it was fixed in the 4th.

Even more critically, they were able to drill into their data and identify the exact users impacted by the issue, allowing their Customer Support team to contact those users before they contacted Customer Support to report an issue.

And all of this would have taken significantly longer using their existing ELK setup.

“If we did not have Honeycomb it may have taken an additional 30 minutes or so to determine what the issue was by poking around in ELK or Cloudwatch logs. Honeycomb’s visualization and honeymarkers made it obvious that the issue was related to a recent release. On top of that, Honeycomb allowed us to determine the affected users and pass that information on to our support team.” - Michael Garski

The team finds it easier and faster to get results with Honeycomb.

“Now that we have Honeycomb, we just use the ELK stack as the archive. “ - Michael Garski

If you think about it, mobile apps are among the production services most in need of real observability: nearly countless hardware platforms and operating systems in combination with your app’s code result in a dizzying matrix of possible sources for any given issue, which means you need the power of true high-cardinality search to solve your problems.

dizzying matrix gif

Have no fear, though–here are some tips for thinking about observing a mobile app:

Take advantage of Honeycomb’s support for high-dimensionality + high-cardinality

In our own production services, we have a standard set of metadata that we know will be useful when debugging production problems / understanding the production “normal,” so we make sure to send them on every event. (You can see some of them listed out in the first section of this dogfooding blog post.) Going through this exercise early on and standardizing among your various mobile clients will make your life easier later on by establishing a set of common field names.

Commonly-useful metadata when slicing and dicing mobile traffic include fields like: device type, OS version, app version, the SDK versions of any crucial third-party SDKs, customer ID. The question to ask yourself at this stage is: what are the attributes / measurements that tend to correlate to customer pain? For an API we measure latency of all sorts; for client-side JS, we measure a whole host of painting/navigation timers; for mobile… what matters most to you, and your use case?

(Important note: anytime you’re writing directly to Honeycomb from a client-side application, be it browser JS or mobile apps, we recommend creating a client-specific write key to retain the ability to rotate a key for a subset of your traffic — just in case a key gets compromised.)

Think through where and when you want to capture each event

We tend to think of Honeycomb events as each capturing a single “unit of work.” On an HTTP server, that often translates naively to one event = one HTTP request served. For mobile, a common pattern (e.g. the Google Analytics SDK relies on this approach) is to trigger an event per screen load — and this is a fine starting point. (Especially if you’re also able to capture timers like “how long did viewDidLoad actually take to run?“)

You may also find yourself wanting to track individual, detailed user actions like:

// User just sent a message!
event.send({
  event_type: "message_sent",
  roundtrip_dur_ms: now - start_ms,
  sender: sender.name,
  payload: message.length,
})

… But wait! There’re a couple of different angles to think about this here. On one hand, tracking an event like this directly from the client side lets you measure latencies as the user perceives them — but if this particular “interesting unit of work” already triggers a round-trip HTTP request to your servers, is it worth adding another HTTP call to Honeycomb’s? What would it look like if you just sent all the metadata along with requests to your server, then in your server’s logic, you extract out the message_sent data?

Upside: fewer HTTP requests from the client.

Downside: you lose the ability to factor in network latency, and it’s a little more complicated to reason about when you have “mobile app” events being emitted from your servers and your mobile clients into the same dataset.

While you’re still fairly low volume, I would recommend being aware of the tradeoffs :) and then blithely going ahead with the simplest solution for you at this time.

Honeycomb is in the sweet spot

When it comes to mobile, Honeycomb is a great middle ground between specialized behavioral analytics tools and specialized operations tools. We believe that life is better (and simpler) when both sides (mobile devs and ops) can view the same set of data as a single source of truth, and we feel pretty strongly that — until you need the sorts of more advanced funneling / retention graphs that mature BI tools provide you — Honeycomb provides enough of the flexibility / ad-hoc investigation that you need to understand what’s going on with your app.

Let us know if you’re thinking about using Honeycomb for mobile — we’d love to hear about your use case and point you at some helpful libraries we’ve got in the works–try us out for free!