Filtering in Context: Get Your Investigation On

File under: little things that go a long way. By popular demand, right click and filter!

Stay in context

Filtering via right click keeps you in context of your investigation. For example: when looking at the summary table, if something looks like an avenue for further investigation, you should be able to dig in without losing your place. (There’s a reason, after all, that the right click menu is called a “context menu”!)

Specifically in the Honeycomb UI, it keeps you from having to scroll around (and either remember values or cut & paste them into the query builder.)

Click to filter

It’s simple, right-click on any cell in the summary table below the graph or the data mode table:

As you can see above, right clicking on a cell containing a value gives you the ability to include or exclude those values, and breakdown by that particular column.

And right clicking on a cell that doesn’t contain a value gives you the ability to include/exclude events based on whether the column has a value.

If you’re using the summary table below the graph, and your query has breakdowns, another operation is available.

“Only show me events in this group” adds filters for every column/value in the breakdown, then removes the breakdown.

If you’re already in edit-mode in the query builder, adding filter clauses doesn’t run a new query. It simply adds the clauses to the builder. You can add multiple new filter clauses (maybe along with switching the filter match to Any), then run the query when you’re done (using shift-enter to keep from having to mouse back up to Run Query). If you aren’t in edit mode, we’ll run the new query immediately.

Stacked Graphs in Honeycomb!

The most common visualization for time series data is the line graph. Seeing each group as an independent line can make it very easy to see what’s going on relative to other lines, but line graphs develop problems when there are many similar lines. Honeycomb highlights the line in the graph when you mouse over the summary table entry for that group, which can help distinguish the lines from one another, but sometimes there’s still too many lines to really make sense of things.

There are other visualization types that can offer a clearer view of what’s going on for this type of data. Perhaps the most common alternative (and the most commonly requested by users!) is the stacked graph, otherwise known as an area chart, stacked area chart, stacked area graph, or half a dozen other names. Here’s an example of what it looks like today in Honeycomb:

The interesting thing about this graph is that there’s a pretty clear discontinuity at 8:50pm. The line graph is so busy (given how many breakdowns are similarly valued) that the change is a challenge to see:

Switching between line and stacked graphs is as easy as a toggle in the gear menu:

The order from top to bottom in the area graph mirrors that of the summary table below the graph. So purple in the above example is the first row in the summary table, with orange below, then olive. Changing the sort order in the table also updates the graph so that this always holds:

Changing sort order, as is plainly visible here, can obscure or illuminate interesting events (watch the graph at 8:50pm, 3:10am and 10:20am), so play around and change the sort of each of the aggregates to see if anything pops out.

What other visualization types are you most excited about? Let us know at support@honeycomb.io or @honeycombio

Dynamic Sampling in Honeytail

A while ago I wrote a three part series on sampling, covering an introduction, some simple straight forward ways to do it, and some ideas for fancy implementations. I’m happy to say that that work has made its way in to Honeytail, our log tailing agent.

Dynamic sampling in Honeytail works with a two phase algorithm - it measures the frequency of values in one or more columns for 30 seconds, computes appropriate sample rates for each value based on trying to fit a logarithmic curve to the traffic, then uses those values for the following 30 seconds. While it’s using those values, of course, it’s measuring the traffic to use updated values for the next window of 30 seconds. In this way it’s continuously adapting to the shape of your traffic, applying the best sample rates to each event as they go by.

After downloading the latest release (version 1.411 or newer), you can use this feature by updating your config file (in /etc/honeytail/honeytail.conf by default):

  • setting a sample rate with the SampleRate config entry (--samplerate command line flag)
  • specifying which fields should have their value distribution measured with one or more DynSample config entries (--dynsampling command line flag)
  • optionally adjusting the 30 second window to something more appropriate for your environment with the DynWindowSec config entry (--dynsample_window command line flag)

Honeytail’s implementation of dynamic sampling is tuned to ensure infrequent events are seen and frequent events are more heavily sampled. This is just what you want when, for example, it is important to be able to see some of every customer’s traffic instead of having high volume customers drown out low volume customer’s traffic. It works great when someone starts sending you huge volumes of the same event and you still want to see what everybody else is sending.

To decide if a field will make a good candidate for the dynamic sampler, try doing a COUNT of your traffic with that field as a BREAKDOWN in Honeycomb. Having an order of magnitude or two between the most frequent events and the least frequent events will give you good results. Here’s an example graph with that property, with good coverage across 4 orders of magnitude (note the log scale on the Y axis, get this from the Gear menu):

graph showing a good distribution of traffic

Be aware, though, that if you try and use a field that does not have the property that some values are more frequent than others, it will do a poor job of deciding how to sample your traffic. For example, if you tried to use a unique request ID as the DynSample field, it would effectively turn off sampling entirely. I’d be happy to talk with you about the shape of your data if you’d like help choosing appropriate settings.

Give it a try! For web traffic, try it out using the HTTP status field as your key. For application traffic, try using a customer ID. (Or both! Then individual errors will still appear even for high volume customers!)

Reflections on Monitorama 2017: From the Metrics We Love to the Events We Need

There were a bunch of talks at Monitorama 2017 that could be summed up as “Let me show you how I built this behemoth of a metrics system, so I could safely handle billions of metrics.” I saw them, and they were impressive creations, but they still made me a little sad inside.

The truth is that most of us don’t actually need billions of metrics. Sure, there are the Googles and Facebooks (and legit - one of these presentations was from Netflix, who actually does need billions of metrics), but most of us don’t really need billions of metrics. And I’m coming from a place of love - I also built a behemoth of a metrics system with multiple tiers of aggregation and mirroring and high availability and fancy dashboards. And it was beautiful. But the real truth is that most of the metrics shoved in to that system would have been better served by something different. Something I didn’t know about back then. Something that exists now.

The problem crept up on me. I didn’t see my few precious numbers multiplying so horrendously until it was too late. I started by watching the core metrics on all my servers - CPU, memory, disk utilization and capacity, network throughput. But then I had questions. This network throughput (I’m on a webserver)… what does it look like? It was a small leap to build a web log tailer and start building metrics about the HTTP status codes of the web traffic flowing through the machine. And it worked! I had graphs of HTTP status so I could see the success and failure rates of my web traffic.

Soon though more questions came in and I expanded the log tailer to capture more nuance in the traffic. It would generate multiple metrics that would then be summed and aggregated by the metrics infrastructure and give both overall numbers and allow you to dive in to specific questions. How many failures are from GETs vs. POSTs? How many are from which webserver? What’s the 90th percentile of the response time instead of just the average? What had started as a few (less than 30) metrics per host soon became 500 (and ultimately closer to 1,500) per server. 500 metrics times 200 hosts for just the web tier and we’re at 100,000 metrics?! No wonder people are trying to build such amazing systems to handle the load. (And this was still just the beginning)

But here’s the secret. Those 470 out of 500 “metrics” per server that I was trying to push? They are not system metrics, for which this metrics system was designed. They’re much closer to application metrics. And are they even metrics? The questions I’m asking about are things like “Which requests to my webserver failed? Why? Who were they from? What customers did they impact?” These are not questions a metrics system can answer because those answers revolve around keeping high cardinality contextual data. Distilling those events to the few metrics I had originally chosen lost all the context necessary to answer those questions.

The key to solving this problem was also interspersed in many of the Monitorama talks. They called it by many different names. Betsy Nichols talked about adding context to your metrics system. Bryan Liles and a few others talked about structured logging. There were many people mentioning tracing.

Metrics are here to stay - they’re an effective way of condensing information about the state of your system to numbers you can put up on a graph and get wonderful visualizations of how your infrastructure has changed over time. The (relatively) recent addition of tags to metrics has allowed even better visualizations, though underneath it still suffers from the problem of metric volume explosion. Several talks mentioned how you should be adding a myriad of tags to your metrics… but not IP address! Not customer ID! Those are too high cardinality and will blow out your storage.

As Roy Rapoport pointed out:

photo of a slide from Roy's talk

The shift from using metrics for everything to an awareness of the importance of context is marking our next evolution as an industry. We’re less interested in the distilled numbers representing a state and more interested in being able to pick that apart and track it down to individual events, customers, servers, stack traces, or states. The path to this kind of analysis is through recording wide events that have all those high cardinality keys along side the rest of the data that gives your events context.

But there’s another reason people like metrics, besides them just being easy to reason about (mostly). They’re cheap. Collecting, transmitting, storing, and analyzing all the events your service creates requires an infrastructure as large as that serving your primary business. Who has the budget to have an analytics platform that is as large as your production infrastructure? (not counting CERN.) Only by intelligently incorporating methods for reducing the total data set while retaining visibility into the parts of your traffic from which you gain the most insight can you hope to manage costs while still gaining the benefit of a modern approach to observability. All events are not of equal interest - the customer that’s generating 20k events per second may care less about each individual event than the one calling your service 10 times per day.

Through a good understanding of the important aspects of the business, you can safely discard 99% of events collected, and through a good understanding of your application combined with good tools, you can throw away 99% of the metrics you’re collecting. This is the direction we need to go, and we need to take our services with us, kicking and screaming.

Happy observability.