Ben Explains Things: how your nginx logs can magically turn query strings into columns

If you’ve ever corresponded with us over email or Intercom, chances are you’ve chatted with Ben Hartshorne. Sometimes we see him send off pearls of wisdom and we think “damn! why is this buried in email, it should really be a blog post!”. Now it is.

SCENE: did you know that if you JUST stream your logs into Honeycomb from whatever load balancer you are running at the edge (haproxy/nginx/etc), and pass some timing information back from your app via HTTP headers, you basically have a functional full stack observability view with very little work??!? IT’S TRUE.

But most people don’t know the crazy fun tricks you can do with edge proxies like nginx. For example, a poor man’s structured data pipeline looks like this:

To: Celebrated Honeycomb Customer
From: Ben
Subject: Re: Your nginx config

Ooh, if I may suggest, as a first step, add some values to the RequestQueryKeys option in the config file (or --request_query_keys flag) to enumerate some of the query parameters in your URL structure. Or, if you're brave, change the RequestParseQuery (--request_parse_query flag) from whitelist to all (only do this on a dataset you can delete - it can quickly blow up the number of columns to something unusable).

This change will pull fields out of the query strings in your URL structure and create columns for them, allowing you to do all the filters, breakdowns, and calculations on them as you would other fields. For example, adding lang to the RequestQueryKeys parameter would create a column named request_query_lang with the value 'en' when it sees a URL like /foo/bar?lang=en. Once there, you could break down by language.

This is a super cheap way to get all sorts of information about your application in to Honeycomb without making any changes to your existing configuration. For more details on the URL parsing bits of honeytail, see https://honeycomb.io/docs/connect/agent/#parsing-url-patterns

Let us know how it goes!

… and a followup, explaining more of the landscape:

To: Most Celebrated Honeycomb Customer #1
From: Ben
Subject: Re: Re: Your nginx config

 Hooray for good timing!! Yeah, I was really hopeful for the 'all' setting that just says "take every query parameter you see and make it a column" until the terrible chaff of the internet hits your website with every webserver exploit known in hopes that one sticks and suddenly you have a hundred extra columns with names like ';cat../../../../../../etc/passwd'. That doesn't help anybody. So the alternative is to just list out the query parameter keys that you want and go from there. More annoying to config, but protects your data. I will say though, a really easy way to get a list of the most common query parameter fields is to enable all and send the data to a temporary dataset, look at the data, choose the right keys, and then delete the temp dataset.

Thanks Ben!!

Sign up and try Honeycomb today!

Save Useful Queries with Honeycomb Playlists

Here at the Hive, we’ve been hard at work on a better way for folks to collaborate while debugging their services. We’re on a mission to help teams to not just move faster with better data, and a lot of that manifests by making shared knowledge accessible, like:

  • How did Ben figure out which customers would be most impacted by yesterday’s unrolling-nested-JSON fix?
  • How is Eben tracking performance for that one customer struggling with Logstash delays?
  • What queries did Charity run last week to dig into that sudden increase in MySQL latency?

To that end, we’re introducing Playlists: a way to create collections of useful Honeycomb queries for both your own use and your teammates’ benefit.

new home page, with playlists and recent team activity

We think Playlists are particularly helpful for two use cases:

  1. Post-mortems: capturing the trail produced by an active investigation
  2. Entry points: capturing a set of high-level “how’s my system doing?” query results.

Queries can be added to a playlist within the query interface itself, or by editing a playlist directly. Playlist permalinks are eminently shareable, just like query permalinks themselves!

add to playlist options: the sandbox and edit playlist ui

Create ‘em, send ‘em around, dig in and get your hands dirty. Honeycomb isn’t for passive consumption :) so use these playlists to capture interesting jumping-off points and dive into queries to iterate and ask new questions.

Sign up and try Honeycomb today!

Instrumenting High Volume Services: Part 3

This is the last of three posts focusing on sampling as a part of your toolbox for handling services that generate large amounts of instrumentation data. The first one was an introduction to sampling and the second described simple methods to explore dynamic sampling.

In part 2, we explored partitioning events based on HTTP response codes, and assigning sample rates to each response code. That worked because of the small key space of HTTP status codes and because it’s known that errors are less frequent than successes. What do you do when the key space is too large to easily enumerate, or varies in a way you can’t predict ahead of time? The final step in discussing dynamic sample rates is to build in server logic to identify a key for each incoming event, then dynamically adjust the sample rate based on the volume of traffic for that key.

In all the following examples, the key used to determine the sample rate can be as simple (e.g. HTTP status code or customer ID) or complicated (e.g. concatenating the HTTP method, status code, and user-agent) as is appropriate to select samples that can give you the most useful view into the traffic possible. For example, here at Honeycomb we want to make sure that, despite a small set of customers sending us enormous volumes of traffic, we’re still able to see our long-tail customers’ traffic in our status graphs. We use a combination of the Dataset ID (to differentiate between customers) and HTTP method, URL, and HTTP status code (to identify different types of traffic they send).

The following methods all work by looking at historical traffic for a key and using that historical pattern to calculate the sample rate for that key. Specifically, the library linked at the end of this post implements all of these examples—it takes snapshots of current traffic for a short time period (say, 30 seconds) and uses the pattern of that traffic to determine the sample rates for the next period.

Enough with the preamble. Let’s sample!

Constant Throughput

For the constant throughput method, you specify the maximum number of events per time period you want to send for analysis. The algorithm then looks at all the keys detected over the snapshot and gives each key an equal portion of the throughput limit. It sets a minimum sample rate of 1, so that no key is completely ignored.

Example: for a throughput limit of 100, given 3 keys, each key should get to send 33 samples. Based on the level of traffic, the sample rate is calculated to try and get each key as close as possible to sending 33 events. For key a, 900 events divided by 33 rounds down to a sample rate of 27; for b: 90 events divided by 33 gives a sample rate of 3, and for c 10 events divided by 33 is less than one so rounds up to 1. During the next cycle, assuming the incoming event numbers are the same, these sample rates are used. 900 events at a sample rate of 27 will actually send 33 events; 90 events at a sample rate of 3 will send 30 events, and finally 10 events at a sample rate of 1 will send 10 events.

I’ll use a table like this in all the following examples. The key and traffic are used to calculate the sample rate. During the following iteration, the key, traffic, and sample rate will determine the number of actual events sent.

key traffic sample rate events sent
a 900 27 33
b 90 3 30
c 10 1 10

Advantages: If you know you have a relatively even split of traffic among your keys, and that you have fewer keys than your desired throughput rate, this method does a great job of capping the amount of resources you will spend sending data to your analytics.

Disadvantages: This approach doesn’t scale at all. As your traffic increases, the number of events you’re sending in to your analytics doesn’t, so your view in to the system gets more and more coarse, to the point where it will barely be useful. If you have keys with very little traffic (Key c, as seen above in the chart), you wind up under-sending the allotted samples for those keys and wasting some of your throughput limit. If your keyspace is very wide, you’ll end up sending more than the allotted throughput due to the minimum sample rate for each key.

Overall, this method can be useful as a slight improvement over the static map method because you don’t need to enumerate the sample rate for each key. It lets you contain your costs by sacrificing resolution in to your data. It breaks down as traffic scales in volume or in the size of the key space.

Constant Throughput Per Key

This is a minor tweak on the previous method to let it scale a bit more smoothly as the size of the key space increases (though not as volume increases). Instead of defining a limit on the total number of events to be sent, this algorithm’s goal is a maximum number of events sent per key. If there are more events than the desired number, the sample rate will be set to correctly collapse the actual traffic into the fixed volume.

Example: set the desired throughput per key to 50. Each key will send up to 50 events per time cycle, with the sample rate set to approximate the actual amount of traffic. The chart here is the same as before—key and traffic are used to compute the sample rate, then the traffic and sample rate are used to show how many events would be sent during the next iteration (assuming the incoming traffic is the same):

key traffic sample rate events sent
a 900 18 50
b 90 2 45
c 10 1 10

Advantages: Because the sample rate is fixed per key, you retain detail per key as the key space grows. When it’s simply important to get a minimum number of samples for every key, this is a good method to ensure that requirement.

Disadvantages: In order to avoid blowing out your metrics as your keyspace grows, you may need to set the per key limit relatively low, which gives you very poor resolution into the high volume keys. And as traffic grows within an individual key, you lose visibility into the details for that key.

This would be a good algorithm for something like an exception tracker, where more copies of the same exception don’t give you additional information (except that it’s still happening), but you want to make sure that you catch each different type of exception. When the presence of each key is the most important aspect, this works well.

Average Sample Rate

With this method, we’re starting to get fancier. The goal for this strategy is achieve a given overall sample rate across all traffic. However, we want to capture more of the infrequent traffic to retain high fidelity visibility. We accomplish both these goals by increasing the sample rate on high volume traffic and decreasing it on low volume traffic such that the overall sample rate remains constant. This gets us the best of both worlds - we catch rare events and still get a good picture of the shape of frequent events.

Here’s how the sample rate is calculated for each key: we count the total number of events that came in and divide by the sample rate to get the total number of events to send along to the analytics system. We then give each key an equal portion of the total number of events to send, and work backwards to determine what the sample rate should be.

Sticking with the same example traffic as the previous two methods, we have keys a, b, and c with traffic of 900, 90, and 10 events coming in. Let’s use a goal sample rate of 20. (900+90+10) / 20 = 50. Our goal for the total number of events to send in to Honeycomb is 50 events. We have 3 keys, so each key should get 50 / 3 = 17 events. What sample rate would we need for key a to send 17 events? 900 / 17 = 52 (rounded). For key b, 90 / 17 = 5 and for key c, 10 / 17 = 1. We now have our sample rates.

Here is the same table we used in the two previous examples:

key traffic sample rate events sent
a 900 52 17
b 90 5 18
c 10 1 10

Advantages: When rare events are more interesting than common events, and the volume of incoming events across the key spectrum is wildly different, the average sample rate is an excellent choice. Picking just one number (the target sample rate) is as easy as constant sampling but you magically get wonderful resolution into the long tail of your traffic while still keeping your overall traffic volume manageable.

Disadvantages: high volume traffic is sampled very aggressively.

At Honeycomb (and in the library below) we actually apply one additional twist to the Average Sample Rate method. The description above weights all keys equally. But shouldn’t high volume keys actually have more representation than low volume keys? We choose a middle ground by using the logarithm of the count per key to influence how much of the total number of events sent into Honeycomb are assigned to each key—a key with 10^x the volume of incoming traffic will have x times the representation in the sampled traffic. For more details, take a look at the implementation for the average sample rate method linked below.

Average Sample Rate with Minimum Per Key

To really mix things up, let’s combine two methods! Since we’re choosing the sample rate for each key dynamically, there’s no reason why we can’t also choose which method we use to determine that sample rate dynamically!

One disadvantage of the average sample rate method is that if you set a high target sample rate but have very little traffic, you will wind up over-sampling traffic you could actually send with a lower sample rate. For example, consider setting a target sample rate of 50 but then only actually having 30 events total! Clearly there’s no need to sample so heavily when you have very little traffic. So what should you do when your traffic patterns are such that one method doesn’t always fit? Use two!

When your traffic is below 1,000 events per 30 seconds, don’t sample. When you exceed 1,000 events during your 30 second sample window, switch to average sample rate with a target sample rate of 100.

By combining different methods together, you mitigate each of their disadvantages and keep full detail when you have the capacity, and gradually drop more of your traffic as volume grows.

Conclusion

Sampling is great. It is the only reasonable way to keep high value contextually aware information about your service while still being able to scale to a high volume. As your service increases, you’ll find yourself sampling at 1001, 10001, 500001. At these volumes, statistics will let you be sure that any problem will eventually make it through your sample selection, and using a dynamic sampling method will make sure the odds are in your favor.

We’ve implemented the sampling methods mentioned here as a go library and released them at https://github.com/honeycombio/dynsampler-go. We would love additional methods of sampling as contributions!

Instrument your service to create wide, contextual events. Sample them in a way that lets you get good visibility into the areas of your service that need the most introspection, and sign up for Honeycomb to slice up your data today!

Instrumenting High Volume Services: Part 2

This is the second of three posts focusing on sampling as a part of your toolbox for handling services that generate large amounts of instrumentation data. The first one was an introduction to sampling.

Sampling is a simple concept for capturing useful information about a large quantity of data, but can manifest in many different ways, varying widely in complexity. Here in Part 2, we’ll explore techniques to handle simple variations in your data, introduce the concept of dynamic sampling, and begin addressing some of the harder questions in Part 1.

Constant Sampling

This code should look familiar from Part 1 and is the foundation upon which more advanced techniques will be built:

func handleRequest(w http.ResponseWriter, r *http.Request) {
  // do work
  if rand.Intn(4) == 0 { // send a randomly-selected 25% of requests
    logRequest(r, 4)     // make sure to track a sample rate of 4
  }
  w.WriteHeader(http.StatusOK)
}

Constant sampling is just the idea that you will submit one event for every n events you wish to represent. In the above example, we randomly choose one out of every four events. we call it constant sampling because you’re submitting 25% of all your events—a constant sample rate of 4. Your underlying analytics system can then deal with this kind of sampling very easily: multiply all counts or sums by 4. (Averages and percentiles are unchanged.)

The advantage of this approach is that it is simple and easy to implement. You can easily reduce the load on your analytics system by only sending one event to represent many, whether that be one in every four, hundred, or ten thousand events.

The disadvantage of constant sampling is its lack of flexibility. Once you’ve chosen your sample rate, it is fixed. If your traffic patterns change or your load fluctuates, the sample rate may be too high for some parts of your system (missing out on important, low-frequency events) and too low for others (sending lots of homogenous, extraneous data).

Constant sampling is the best choice when your traffic patterns are homogeneous and constant. If every event provides equal insight into your system, than any event is as good as any other to use as a representative event. The simplicity allows you to easily cut your volume.

Dynamic Sampling

The rest of the sampling methodologies in this series are dynamic. When thinking about a dynamic sample rate, we need to differentiate something we’ve glossed over so far. Here are two ways of talking about sample rate:

  • This dataset is sampled four to one
  • This specific event represents four similar events

Until now we’ve used the two ways of looking at sampling interchangeably, but each contains a different assumption—the first is only true if all the data in the dataset is sampled at the same rate. When all data is sampled at the same rate, it makes calculations based on the stored data very simple.

The second way of talking about sample rates has a different implication—some other event might represent a different number of similar events. If so, it brings up some severe complications—how do you do math on a body of events that all represent a different portion of the total corpus?

When calculating an average based on events that all have different sample rates, you actually need to walk through each event in order to calculate your average and weight each event accordingly. For example, with three events with sample rates of 2, 5, and 10, to calculate the average, you need to multiply the value in the first event by 2, the value in the second by 5, the value int he third by 10, add up those values, then divide by the total number of events represented (17).

value sample rate contribution to the average
205 2 410
317 5 1585
281 10 2810
——- ————- ——
17 4805

Average: 4805 / 17 = 282.65

Within Honeycomb, every event has an associated sample rate and all the calculations we do to visualize your data correctly handle these variable sample rates. This feature lets us start to do some really interesting things with sampling within our own systems: reducing our overall volume while still retaining different levels of fidelity depending on the type of traffic.

Introducing dynamic sampling lets us address some of the harder questions posed in the first post of this series.

Static Map of Sample Rates

Building a static map of traffic type to sample rate is the our first method for doing dynamic sampling. We’re going to enumerate a few different types of traffic and encode different sample rates for each type in our code. By doing so, we can represent that we value different types of traffic differently!

When recording HTTP events, we may care about seeing every server error but need less resolution when looking at successful requests. We can then set the sample rate for successful requests to 100 (storing one in a hundred successful events). We include the sample rate along with each event—100 for successful events and 1 for error events.

func handleRequest(w http.ResponseWriter, r *http.Request) {
  // do work
  if status == http.StatusBadRequest || status == http.StatusInternalServerError {
    logRequest(r, 1)              // track every single bad or errored request
  } else if rand.Intn(100) == 0 {
    logRequest(r, 100)            // track 1 out of every 100 successful requests
  }
  w.WriteHeader(status)
}

By choosing the sample rate based on an aspect of the data that we care about, we’re able to gain more flexible control over both the total volume of data we send to our analytics system and the resolution we get looking into interesting times in our service history.

The advantage of this method of sampling is that you gain flexibility in determining which types of events are more important for you to examine later, while retaining an accurate view into the overall operation of the system. When errors are more important than successes, or newly placed orders are more important than checking on order status, or slow queries are more important than fast queries, or paying customers are more important than those on the free tier, you now have a method to manage the volume of data you send in to your analytics system while still gaining detailed insight into the portions of the traffic you really care about.

The disadvantage of this method of sampling is that if there are too many different types of traffic, enumerating them all to set a specific sample rate for each can be difficult. You must additionally enumerate the details of what makes traffic interesting when creating the map of traffic type to sample rate. While providing more flexibility over a constant rate, if you don’t know ahead of time which types might be important, it can be difficult to create the map. If traffic types change their importance over time, this method can not easily change to accommodate that.

This method is the best choice when your traffic has a few well known characteristics that define a limited set of types, and some types are obviously more interesting for debugging than others. Some common patterns for using a static map of sample rates are HTTP status codes, error status, top tier customer status, and known traffic volume.

For an example of the static map, take a look at the documentation for logstash on our web site and a sample logstash config that includes dynamic sampling based on HTTP status code for logstash.

Up Next - My traffic is a KALEIDOSCOPE of change!

This entry in the series introduced the idea of dynamic sampling. By varying the sample rate based on characteristics of the incoming traffic, you gain tremendous flexibility in how you choose the individual events that will be representative of the entire stream of traffic.

In Part 3 we take the idea of dynamic sampling and apply realtime analysis to dynamically adjust the sample rate! Why stop at a person identifying interesting traffic—let’s let our applications choose how to sample traffic based on realtime analysis of its importance and volume! But until it’s published, try signing up for an account on Honeycomb and sending some data of your own with variable sample rates.

Instrumenting High Volume Services: Part 1

This is the first of three posts focusing on sampling as a part of your toolbox for handling services that generate large amounts of instrumentation data.

Recording tons of data about every request coming in to your service is easy when you have very little traffic. As your service scales, the impact of measuring its performance can cause its own problems. There are three main ways to mitigate this problem:

  • measure fewer things
  • aggregate your measurements before submitting them before submitting them
  • measure a representative portion of your traffic

Each method has its place; this series of posts focuses on the third: various techniques to sample your traffic in order to reduce your overall volume of instrumentation, while retaining useful information about individual requests.

An Introduction to Sampling

Sampling is the idea that you can select a few elements from a large collection and learn about the entire collection by looking at them closely. It is widely used throughout the world whenever trying to tackle a problem of scale. For example, a survey assumes that by asking a small group of people a set of questions, you can learn something about the opinions of the entire populace.

Sampling as a basic technique for instrumentation is no different—by recording information about a representative subset of requests flowing through a system, you can learn about the overall performance of the system. And as with surveys and air monitoring, the way you choose your representative set (the sample set) can greatly influence the accuracy of your results.

This series will explore various methods appropriate for various situations.

A naive approach to sampling an HTTP handler might look something like this:

func handleRequest(w http.ResponseWriter, r *http.Request) {
  // do work
  if rand.Intn(4) == 0 { // send a randomly-selected 25% of requests
    logRequest(r)
  }
  w.WriteHeader(http.StatusOK)
}

By sampling with this naive method, however, we lose the ability to easily pull metrics about our overall traffic: any graphs or analytics that this method produces would only show around 25% of our actual, real-world traffic.

the non-negotiable: capturing the sample rate

Our first step, then, is capturing some metadata along with this sample datapoint. Specifically, when capturing this request, we’d want to know that this sampled request represents 4 (presumably similar) requests processed by the system. (Or, in other words, the sample rate for this data point is 4.)

func handleRequest(w http.ResponseWriter, r *http.Request) {
  // do work
  if rand.Intn(4) == 0 { // send a randomly-selected 25% of requests
    logRequest(r, 4)     // make sure to track a sample rate of 4, as well
  }
  w.WriteHeader(http.StatusOK)
}

Capturing the sample rate will allow our analytics backend to understand that each stored datapoint represents 4 requests in the real world, and return analytics that reflect that reality. (Note: If you’re using any of our SDKs to sample requests, this is taken care of for you.)

OK, but my traffic isn’t ever that simple:

Next, we’re ready to tackle some harder problems:

  • What if we care a lot about error cases (as in, we want to capture all of them) and not very much about success cases?
  • What if some customers send an order of magnitude more traffic than others—but we want all customers to have a good experience?
  • What if we want to make sure that a huge increase in traffic on our servers can’t also overwhelm our analytics backend?

Coming up in parts 2 and 3, we’ll discuss different methods for sampling traffic more actively than the naive approach shown in this post. Stay tuned, and in the mean time, sign up for Honeycomb and experiment with sampling in your own traffic!

Event-Driven Instrumentation in Go is Easy and Fun

One of many things I like about Go is how easy it is to instrument code. The built-in expvar package and third-party libraries such as rcrowley/go-metrics are delightfully simple to use.

But metrics aren’t quite enough! We’re here to encourage you to structure your instrumentation not just around metrics, but around events.

Let’s make that idea concrete with an example. Imagine a frontend API for a queue service. It accepts user input, and writes it into a Kafka backing store. Something like this:

func (a *App) handle(r *http.Request, w http.ResponseWriter) {
    userInfo := a.fetchUserInfo(r.Headers)
    unmarshalledBody := a.unmarshalInput(r.Body)
    kafkaWrite := a.buildKafkaWrite(userInfo, unmarshalledBody)
    a.writeToKafka(kafkaWrite)
}

With a timing helper, we can track — in aggregate – how long it takes to handle a request, and how long each individual step takes:

func (a *App) timeOp(metricName string, start time.Time) {
    a.timers[metricName].UpdateSince(start)

func (a *App) handle(r *http.Request, w http.ResponseWriter) error {
    defer a.timeOp("request_dur_ms", time.Now())
    // ...
}

func (a *App) fetchUserInfo(r.Headers) userInfo {
    defer a.timeOp("fetch_userinfo_dur_ms", time.Now())
    // ...
}

// ...

The limitation of this approach is that global timers can show you when an anomaly occurs, but not necessarily why. Let’s say we observe occasional spikes in overall measured request latency:

If all we have is sitewide metrics, it’s going to take lot more poking to figure out what’s going on. In fact, our other timing metrics might not have any correlation with those spikes at all:

Contrast this with event-driven instrumentation. It’s not that complicated! We just build up an event object with all the data we want to record: our original timers, plus request URL, user ID, and whatever other data we can think of. Once we’ve handled the request, we ship off the event.

func (a *app) handle (r *http.Request, w http.ResponseWriter) {
    start = time.Now()
    ev := libhoney.NewEvent()
    ev.AddField("url", r.URL.Path)

    userInfo := a.fetchUserInfo(r.Headers, ev)
    ev.AddField("user_id", userInfo.UserID)

    unmarshalledBody := a.unmarshalInput(r.Body, ev)
    ev.AddField("is_batch", unmarshalledBody.IsBatchRequest)

    kafkaWrite := a.buildKafkaWrite(userInfo, unmarshalledBody, ev)
    a.writeToKafka(kafkaWrite)

    ev.AddField("request_dur_ns", time.Now().Sub(start))
    ev.Send()
}

Here we’re using libhoney, Honeycomb’s lightweight library for Go. Honeycomb is built to handle exactly this type of wide, flexibly typed event data. Send it all! The more fields you can think of, the better.

The usefulness of this approach becomes clear when tracking down actual problems. Let’s go back to that latency spike:

Zooming in on one of those spikes, 99th-percentile latency on a smaller time interval looks pretty erratic:

Let’s try to breaking down the data by URL:

Aha! Looks like it’s requests to /1/batch that are slow. Zooming back out in time, but breaking down response times by URL, here’s what’s really going on. We have a pretty bimodal latency distribution — slow batch requests and fast normal requests — and sometimes this skews the P99 metric:

Of course, we could have figured this out by adding separate timing metrics for each API endpoint. But if you want to slice and dice by user ID, request length, server host, or any other criteria you can think of, you’ll quickly find yourself facing a combinatorial explosion of metrics.

So please, instrument in terms of events, not just metrics!

Head on over to https://ui.honeycomb.io/signup to get started, run go get github.com/honeycombio/libhoney-go, and add a couple lines of instrumentation to your code. Do it right now! Thank yourself later!

Tell me more, nginx

When the cool kids talk about interesting log data, no one seems to want to talk about nginx. Web servers are the workhorses of the internet, reliable and battle-hardened, but the logs are standard and never have anything of the interesting application metadata anyway.

But wait! Web logs are also often one of the easiest, broadest sources of data for systems at a high level—and are killer for answering questions like “Oops—I forgot to instrument my new feature, how much is its corresponding endpoint actually being hit?” or “Uh… how many req/sec is this service serving, really?”

There’s an almost infinite number of possible customizations given enough time and patience, but here are a couple quick wins we’ve found particularly essential:

  • Request time (docs): Did you know that nginx can calculate start-to-finish times per request, right out of the box? Don’t leave the house without this one, folks: just add $request_time anywhere in your log_format directive.
  • Upstream response headers (docs): Tuck useful metadata from your application into your HTTP response headers, and tell nginx to pull X-Whatever-You-Want out into your logs.

Voilà! User IDs, database timers, and server metadata—right alongside high-level, granular data surrounding each request.

Let’s take a look at what that can tell us about a service:

Better nginx exploration: manipulating request_time and endpoint_shape

With just request_time and a header value generalizing the shape of the endpoint path (e.g. /products/foobarbaz/products/:prod), we can see—incredibly quickly—which endpoints (not specific URLs!) are the slowest, or the most numerous, or causing the most load on my server.

Guess what else we can do with nginx?

Oops! you’ll have to wait for the next installment.

The sky’s the limit when it comes to slicing this birds’-eye view of your web traffic by arbitrary app metadata: ID the users seeing the most 404s, or graph load by availability zone, hostname, or database cluster! Read more about working with response headers here or sign up to give it a whirl today :)

Note: we talk about upstream response headers as a way to get normalized endpoint shapes, but our nginx connector can do that for you, too!

Is Honeycomb a monitoring tool?

You may notice that we don’t talk about “monitoring” much, and that’s because we don’t really think of monitoring as what we do, even though it kind of is.

Traditional monitoring relies heavily on predicting how a system may fail and checking for those failures. Traditional graphing involves generating big grids of dashboards that sit on your desktop or your wall, and give you a sense of the health of your system.

That’s not what we do.

Honeycomb is what you do when your monitoring ends.

You still need some simple end-to-end checks for your KPIs, and monitoring for key code paths, but don’t try to “monitor everything” because it’s noisy and impossible. One of two things will happen:

  1. Your e2e checks will tell you that one of your KPIs is not within acceptable range. (So you jump into honeycomb and start asking questions.)
  2. A user tells you something is off, but your KPIs are all within acceptable ranges. (So you jump in to honeycomb and start asking questions.)

You can start at the edge and drill down, or start at the end and trace back; either way, within a half dozen clicks you can usually identify the source of the anomaly.

Honeycomb is a debugger for live production systems. Honeycomb is to your systems like your IDE is to your code. For sufficiently complex systems, you should probably spend roughly equal time in both.

Honeycomb lets you ask questions of your live systems, swiftly and interactively. Often you will spot an outlier or correlation that you weren’t expecting – something that never happens when you’re doing intuition-based debugging instead of data-driven debugging. Validate your guesses swiftly and move on.

This will make you a better engineer. :)

Systems are getting more complex every day, outstripping our ability to predict failure conditions or alert on suspicious symptoms. Stop trying to debug via intuition or what you can remember about past outages. Just keep it simple:

  1. Alert on KPIs
  2. Instrument your systems
  3. Ask lots of questions

Introducing Honeycomb's TCP Agent for MongoDB

We’re excited to release honeycomb-tcpagent, an efficient way to get query-level visibility into your MongoDB deployment. honeycomb-tcpagent parses TCP traffic between MongoDB clients and servers, and reconstructs queries in a friendly JSON format. Honeycomb helps you explore this data to quickly uncover anomalies.

Get started with Honeycomb and run the agent, or keep reading for more background and examples.

Are you running a database that’s not MongoDB? Let us know! Support for MySQL is already in the works.

Database Observability Means Lots of Questions About Lots of Data

For any serious database performance work, the ability to fully capture a workload is invaluable. The power to slice and dice by any criteria you want, doubly so. What’s the read vs write breakdown? Okay, how about for a specific collection? What’s the 99th percentile latency for a particular query family? Maybe network throughput seems high compared to query volume — are some queries returning too much data? Is a given server handling abnormally many queries? Is a particular client sending too many queries?

And so on.

Unfortunately, the aggregate statistics exposed by db.serverStatus() and the like can only take you so far in answering these types of complex questions. Slow query logs are very useful, but often hide half the story: performance problems can be caused by many relatively fast queries, rather than a few slow ones. But on an I/O-bound database, full query logging tends to limit database throughput; enabling full query logging on an already-struggling database is akin to trying to put out a grease fire with water.

There Must Be Another Way

Another approach is to analyze actual network traffic between database clients and servers. By using a packet capturing mechanism to passively inspect MongoDB TCP traffic, it’s possible to reconstruct every request and response in real time. This strategy is very efficient, and doesn’t require any database reconfiguration or instrumentation in the application layer. honeycomb-tcpagent does exactly this, and pipes structured JSON to stdout for our honeytail connector to forward on:

honeycomb-tcpagent | honeytail -d "MongoDB" -p json -k $WRITEKEY -f -

Zero to Root Cause in 60,000 Milliseconds

For an example of what you might do with this data, imagine that MongoDB CPU usage is creeping upward, as shown in the graph below. Let’s find out why before the database runs into serious trouble.

We hypothesize that we’re handling more queries, or that individual queries are becoming more expensive — perhaps because of a missing index. So let’s look at overall query count and latency.

There are no clear trends in these aggregates. But it could be that while the bulk of our query workload is stable, a few queries are behaving badly — yet not enough to skew the aggregated metrics. Let’s break down latency by collection and normalized family:

There’s a lot of data here — can we discern any trends in individual series?

There we go. Looks like we have a new query pattern that’s slow and getting slower. Now we can go add the right index.

But Don’t Forget About Logs and Metrics Just Yet

TCP-level analysis is not a panacea. You still need system statistics and slow query logs for high-level instance metrics, and database-internal data — such as lock retention — that can’t be extracted from the TCP stream. Honeycomb helps you ingest server logs and statistics to get a complete picture of what’s going on.

Try it out

Sign up to use Honeycomb with your own data, check out the agent’s source code if you’re curious, and don’t hesitate to get in touch!

Instrumentation: system calls: an amazing interface for instrumentation

When you’re debugging, there are two basic ways you can poke at something. You can:

  • create new instrumentation (like “adding print statements”)
  • use existing instrumentation (“look at print statements you already added”, “use Wireshark”)

When your program is already running and already doing some TERRIBLE THING YOU DO NOT WANT, it is very nice to be able to ask questions of it (“dear sir, what ARE you doing”) without having to recompile it or restart or anything.

I think about asking questions of a program in terms of “what interfaces does it have that I can observe?”. Can I tell which system calls that program is using? Can I look at its network requests? Can I easily see database queries?

interfaces

In this post, we’ll talk about my very favourite interface: system calls

what’s a system call?

Your program does not know how to open files: it doesn’t know what a hard drive is, or how to read a filesystem, or any of that. However, your operating system does know all that stuff!

System calls are the interface that your program (no matter what programming language it’s in) uses to

why system calls are an amazing interface

System calls are my favorite interface to debug because every program uses system calls.

A few examples of questions that you can ask using system calls:

  • Which files is my program opening or writing to right now? This is surprisingly useful – it has uncovered hidden configuration files and log files for me many times.
  • Which other programs is my program executing? You can see every program that’s being started, and all its arguments
  • Which hosts is my program connecting to (the connect system call). This can also help you see timeouts.

These are all pretty simple questions, but without being able to observe which system calls your program is using, they can be very hard to answer!

How do I see which system calls my program is using?

You use a program called strace on Linux! Or dtruss on OS X.

I have a fanzine about how to use strace, if you are new to it and would like to learn more!