How Honeycomb Uses Honeycomb, Part 7: Measure twice, cut once: How we made our queries 50% faster...with data

This post continues our dogfooding series from How Honeycomb Uses Honeycomb, Part 6: Instrumenting a Production Service .

The entire value proposition of Honeycomb’s columnar store is speed of queries, for instance:

query timing graph

Examining over 1.2 million rows and returning 50 time series, aggregating results into 240 (1 minute granularity over 4 hours for this query) buckets per series, in 330ms. Nice.

The Rock, nice

This is just the time spent on our storage nodes, though. It doesn’t include the amount of time it takes for the browser to download and render the results, and (more importantly for this post) it doesn’t include the time it before the browser asks for the results.

The result-fetching code is some of the oldest in our javascript. Very early on we settled on the simplest solution - polling. While we show a loading spinner on the page, the browser is repeatedly (on a timeout) asking our app if the query is done. The app responds with “no” until the results have been persisted to S3 at which point it says “yes” and returns the result JSON blob which gets rendered on the page.

The code (up until a couple weeks ago) had a constant poll interval of 1 second. So in that query up there? We’re wasting upwards of half a second that the graph could be in front of the user. Why did we expend all that effort to make our queries so fast, if the browser is just going to idle all the time gains away?

Still, it’s a system that works, it’s fast enough for many users (yes, some queries do take longer than a second :) and we also have some other big fish to fry. So we asked ourselves “what’s the minimal amount of work we can do to get queries in front of users faster?”

The answer, perhaps the most boring answer ever, is: don’t wait as long before polling. But how long should we wait? To figure this out we looked at query times across all our customers (in Honeycomb of course):

query timing graph

These particular results show percentiles over a derived column that sums two timers: the time spent waiting on the storage node to perform the query, and the time spent persisting the results to s3. They show that 10% of queries are stupid fast, completing and persisting to S3 in under 110ms, and 50% of queries complete in under 412ms.

We wanted a nice round number for the initial poll interval, so we chose 250ms. In order to not completely spam our app server, we only do two polls at 250ms, after which we switch to 500ms. Poll times of 1 second are completely banished. We also added instrumentation to see how many poll iterations it took before our app sent data down to the browser. How did things change?

Using a pretty representative sample of ~5300 queries over the past 24 hours:

query timing graph

91% were serviced in one poll iteration - 250ms from the javascript requesting results until the browser started downloading those results. 94% were serviced in two poll iterations. 500ms.

94% of the sampled queries run over that 24 hour period completed end-to-end a half second faster than before.

Eagle eyed readers might have the question: “why did decreasing the poll time to cover the P25 end up covering 91% of the sampled queries?” The biggest reason currently is additional time wasted in the front end (causing lag between the user action of running a query and the start of poll loop.) More low hanging fruit :)

This isn’t a bad start for such a simple change. There’s still a lot of frontend/transport work left to do to lower the remaining overhead in other parts of the system - some easy, some not so easy. If user-delight increasing optimizations spanning browsers, app servers, storage systems, and transport mechanism between all the above are something you’re interested, come work with us!

This is another installation in a series of dogfooding posts showing how and why we think Honeycomb is the future of systems observability. Stay tuned for more!

Bitten by a Kafka Bug - Postmortem

Dearest honeycombers,

Yesterday, on Tuesday, Oct 17th, we experienced a partial service outage for some customers, and a small amount of data was dropped during ingestion (not any previously stored data).

In terms of impact, 33% of users actively sending data during the incident window experienced partial loss of writes between 6:03 AM and 10:45 AM PDT. Of those, most lost less than half of their writes. In addition to this, a majority of users experienced a 30 minute period of partial read availability between 10:50 and 11:20 AM PDT.

The type of outage we experienced was a new one for this team. But it was also a particularly constructive outage, because it helped several members of team level up substantially on their kafka knowledge and their mental model of how to recover from any operationally degraded kafka scenario.

We use Honeycomb (in a wholly separate environment) to understand and debug the Honeycomb that serves our customers, so we figured we would walk you through what happened and how we debugged it.

At 6:15 am, PagerDuty started alerting that our end-to-end production checks were failing on a few partitions. But data and storage nodes all seemed healthy, so we checked kafka. As you can see, one of the kafka nodes had effectively no network traffic flowing into it.

Below, a graph showing inbound network traffic sharply decreasing for a suspect kafka node:

We checked kafka output, and sure enough, that node was marked as ISR for the partitions (5, 6, 32, 33) that corresponded to the alerting data nodes. At 7:07 we noticed that the “bad” brokers didn’t recognize any of the other brokers as part of their clusters, while the “good” brokers thought all brokers were healthy and all partitions had offsets that were increasing (except for 5, 6, 32, 33).

Below, a graph tracking our end-to-end successes and failures. dataset values here map to individual Kafka partitions, so we had a clear picture of the ongoing failures as we tried to resolve the issue:

We tried bringing up a new kafka node, but it didn’t know about any of the “good” brokers… only the “bad” brokers. We tried restarting a node or two, while carefully preserving the balance of replicas and leaving all partitions with a quorum… which turned out to be important, because after a restart, the nodes were only aware of the “bad” brokers.

Much debugging and googling of error messages ensued, with several red herrings, including some flights down zookeeper paths of fancy that were ultimately useless.

Eventually we realized we had a split brain between the nodes that had been restarted (not serving traffic, could see each other) and the ones that had not (were serving traffic, could not see restarted nodes). And at 9:37, we found a jira ticket that was three years old but sounded very familiar. We realize that the “bad” brokers were actually good ones, and the good ones were actually bad. We needed to do a rolling restart of the entire world.

After restarting all kafka nodes (which we did with perhaps too much caution and care… it was a first time for the engineers working the outage), it turned out that the data nodes had an offset far ahead of the acknowledged offset, because kafka had kept accepting writes even though zookeeper didn’t acknowledge them as being part of the world.

So we had to manually reset the offset on the data nodes and restart or bootstrap them as well. By 12:03pm the world was restored to order.

Below, a stacked graph showing successful writes to our storage nodes broken down by kafka partition. You can see the initial drop shortly after 6am of those 4 failing partitions, some blips around 9:15am as our engineers tried restarting the thought-were-bad-but-actually-good brokers, and some broader ingest problems between 10:45-11:20 as we restarted the world:

What we learned

Well… we learned a LOT. Frankly, I’m really happy that so much of our team got to experience their first big kafka incident, and leveled up commensurately at their kafka skills. I think the entire team now has a much broader skill set and richer mental model for how kafka really works. In some ways this was inevitable, and it’s a relief that it happened sooner rather than later.

In retrospect, the incident was actually seeded last night when our zookeeper cluster experienced a network partition. We just didn’t know about it until the controller thread did Something™ around 6 am, which caused it to finally be symptomatic.

This is apparently all due to a number of issues that have been fixed by kafka 0.10.2.1, so we need to upgrade our cluster ASAP. Other action items on our plates are:

  • Get better instrumentation for kafka and zookeeper into honeycomb
  • Get the retriever kafka partition into ec2 instance tags, write shell functions for printing retriever/partition mappings
  • Instrumentation for failed writes to kafka (consume sarama response queue?)
  • Change the way data nodes handle invariants, to avoid manual intervention when it has with a kafka offset ahead of the broker’s
  • Document the bash snippets, command lines, and other bits we used to debug retriever and kafka and create a production runbook for them

We apologize for the outage, and to the customers who were affected. We know you rely on us, and we take that very seriously.

We were heartened to see all the traffic going to our status.honeycomb.io page, and thought it was lovely that our customers were watching our shiny new status page instead of pinging us on intercom to ask what’s up. We have wonderful customers. :)

Thanks for hanging in there. Til next time, the honeybees.

P.S. A huge thanks to John Moore of Eventador for helping confirm some of our suspicions, providing pointers to known kafka issues, and suggesting fixes along the way.

You Could Have Invented Structured Logging

Sometimes we hear from folks who are a little bit intimidated by the notion of structured logging. Some common issues:

  • There’s no approachable library for structured logging in my language.
  • My logging today is messy, and changing it all is a daunting project.

These are legitimate concerns! But I have some opinions:

  • You might not need a fancy pants library
  • You can make incremental changes to your logging setup.

Structured logging is really all about giving yourself — and your team — a logging API to help you provide consistent context in events. An unstructured logger accepts strings. A structured logger accepts a map, hash, or dictionary that describes all the attributes you can think of for an event:

  • the function name and line number that the log line came from
  • the server’s host name
  • the application’s build ID or git SHA
  • information about the client or user issuing a request
  • timing information
  • etc.

The format and transport details — whether you choose JSON or something else, whether you log to a file or stdout or straight to a network API — might just be less important!

Structured logging basically just means you make a map and print it out or shove it in a queue

Here, let’s write a structured logging library!

def log(**data):
   print json.dumps(data)

All right, maybe we’re not quite ready for open-source fame yet. But there are two nice things about this “library”: first, it doesn’t even let you pass a bare string message! You have to pass a dictionary of key-value pairs. Secondly, it produces structured, self-describing output that can be consumed by humans and machines alike.

For example, this log line is not self-describing:

127.0.0.1 - - [12/Oct/2017 17:36:36] "GET / HTTP/1.1" 200 -

I mean, we can kinda guess at what it’s about. But if we start adding more data, it’s going to be really hard to remember which dash means what. In contrast, we can write code such as:

log(upstream_address="127.0.0.1",
    hostname="my-awesome-appserver",
    date=datetime.now().isoformat(),
    request_method="GET",
    request_path="/",
    status=200)

and get this output:

{
    "upstream_address": "127.0.0.1",
    "hostname": "my-awesome-appserver",
    "date": "2017-10-21T17:36:36",
    "request_method": "GET",
    "request_path": "/"
    "status": 200
}

That’s comprehensible both to machines that might parse these logs later, and to our fellow humans. Awesome.

There are a few more features worth adding. For example, we might want to add the same context at different places in our code. So let’s wrap the logger in a class that we can bind context to:

class Logger(object):
   def __init__(self):
       self.context = {}

   def log(self, **data):
       data.update(self.context)
       print json.dumps(data)

   def bind(self, key, value):
       self.context[key] = value

Now we can write

logger = Logger()
logger.bind("hostname", "my-awesome-appserver")
logger.bind("build_id", 2309)

And all calls to logger.log will automatically include hostname and build_id.

We also might want more flexible control over the output:

class Logger(object):
   def __init__(self, writer=None):
       self.context = {}
       self.writer = writer or jsonwriter

   def log(self, **data):
       data.update(self.context)
       self.writer(data)

   # ....

def jsonwriter(data):
        print json.dumps(data)

Now when our colleagues complain that JSON hurts their eyes, we can substitute a different writer for their local development. We can also test our logging by using a writer that accumulates records in memory, or introduce some buffering for performance if needed.

Finally, we want to do some post-processing of events: e.g., automatically add timestamps, or redact certain fields:

class Logger(object):
   def __init__(self, writer=None):
       self.context = {}
       self.processors = []
       self.writer = writer or jsonwriter

   def add_processor(self, processor):
       if processor not in self.processors:
           self.processors.append(processor)

   def log(self, **data):
       data.update(self.context)
       for p in self.processors:
           p(data)
       self.writer(data)

   # ...

def timestamper(data):
    data["timestamp"] = datetime.now().isoformat()

def caller_recorder(data):
    # Python magic, don't worry about it
    frame = sys._getframe()
    caller_frame = frame.f_back.f_back
    data["caller"] = caller_frame.f_globals.get("__name__")

Now if we instantiate our logger like so:

logger = Logger()
logger.add_processor(timestamper)
logger.add_processor(caller_recorder)

Then all our logs automatically include the timestamp and call site:

# Application code
def square(arg):
    result = arg**2
    logger.log(arg=arg, result=result)
    return result

square(22)
# produces the log:
#{
#    "caller": "square",
#    "timestamp": "2017-10-21T17:36:36",
#    "arg": 22,
#    "result": 484
#}

All right. So with fewer than 30 lines of code, we have testable structured logging, we can choose how to transport those logs, and we have composable output transformation. Not bad.

janeway says not bad!

Some suggestions for effective structured logging

Don’t be afraid to develop your own idioms

It’s totally reasonable for a mature project or organization to maintain its own module of logging conveniences. You can have some startup configuration that outputs nice colors and pretty formatting when you’re developing locally, but just emits JSON when your service runs in production. You can have helpers for adding domain-specific context to all your requests (e.g.customer name, ID and pricing plan). You can be creative.

Don’t be afraid to make incremental changes

You might have lots of log statements all over your code base. You might have some gnarly logging pipeline. Don’t worry about that for now. First, identify your service’s core unit of work: is it serving an HTTP request? Is it handling a job in a job queue?

Then, write one structured event for that unit of work. “Every time we’re done handling a request, we record its status code, duration, client ID, and whatever else we can think of.” Don’t sink weeks into changing all of your logging at once. If structured logs for even a subset of your code base help you better understand what’s happening in production (and they probably will), then you can invest in standardization. Don’t be daunted!

Nylas Makes Their API buzz with Honeycomb

At Honeycomb, we <3 Nylas, so we were super pleased to come across this short talk by Evan Morikawa (aka @e0m on Twitter):

link to video

Evan does a great job of explaining the goals Nylas set for speeding up their API, and breaks the process they went through down into easy-to-understand steps, including a quick demo of our shiny new heatmap feature (which they got a little early, because we love them :D :D :D)

And we’re of course super pleased to hear that Nylas loves us right back!

doggos are pals.gif

Evan goes on to talk about another cool approach to gaining observability into distributed production systems: building tools to profile services as they run, and shows some glorious flame graphs:

glorious flame graph

Ending with a beautiful, “Yosemite-like drop-off” in API roundtrip time, and the ever-important call to instrument your code so you can gain observability into what it is doing.

Thanks for sharing this part of the Nylas journey with us, Evan!

Here’s a direct link to the video: https://www.youtube.com/watch?v=16gX5CpQVY8