Metrics: not the observability droids you're looking for

I went to Monitorama last year for my first time. It was great; I had a terrific time. But I couldn’t help but notice how speaker after speaker in talk after talk spent time either complaining about the limitations of their solutions, or proudly/sadly showing off whatever terrible hacks they had done to get around the limitations of how events were being stored on disk.

I went to Strange Loop a couple of weeks ago, and the same thing happened in all the talks I saw or heard of that were about monitoring- or analytics-related topics. People were saying things like “it would sure be nice to be able to group by high-cardinality dimensions, but that’s impossible.”

High-cardinality explained (and why you want it)

For those who don’t spend their days immersed in this shit, cardinality is the # of unique values in a dimension. So for example if you have 10 million users, your highest possible cardinality is something like unique UUIDs. Last names will be lower-cardinality than unique identifiers. Gender will be a low-cardinality dimension, while species will have the lowest-cardinality of all: {species = human}.

When you think about useful fields you might want to break down or group by…surprise, surprise: all of the most useful fields are usually high-cardinality fields, because they do the best job of uniquely identifying your requests. Consider: uuid, app name, group name, shopping cart id, unique request id, build id. All incredibly, unbelievably useful. All very high-cardinality.

And yet you can’t group by them in typical time series databases or metrics stores. Grouping is typically done using tags, which have a hard upper limit on them.

droid running into a wall

People hack around this in all kinds of terrible ways, because that data is so desperately valuable. I remember auto-generating dashboards for certain important users, or for the top ten users according to some column. Ben used to generate a tag for each new build id, then reap old tags continuously so we could stay within the tags limit. Etc. But why so much pain and suffering?

Metrics means tags, and lots of tags means expensive

It boils down, essentially, to the metric. The metric is a dot of data, a single number with a name and some identifying tags. All of the context you can get has to be stuffed into those tags. But the write explosion of writing all those tags is expensive because of how metrics are stored on disk. Storing a metric is dirt cheap, but storing a tag is expensive; and storing lots of tags per metric will bring your storage engine to a halt fast.

This isn’t the fault of metrics. Metrics are what they are: a cheap, fast way to aggregate lots of details about your system at write time. Metrics are very good at that. What they’re not good at is: correlating lots of things together, providing context, drilling down into a unique request, a unique user, etc.

Metrics are terrific for describing the health of the system. Metrics are critical to monitoring the known unknowns of the system.

Nines don’t matter if users aren’t happy.

I’m going to make a controversial and only partly-true statement: the health of the system…basically doesn’t matter. When you’re in unknown-unknown territory, what actually matters is the health of the request.

Consider a scenario where one of your four AWS availability zones is down. Your system is 25% down! But you architected it well, and no users have been impacted. Do you care? Should you get paged?

On the flip side, consider a scenario where you have 99.999% uptime…but your users are complaining constantly. They’re hitting edge cases, timeouts in the client code, unpredictable behavior. Do you care?

I sure do. Nines don’t matter if users aren’t happy.

droid having feelings

If the health of the request is key, you need a different set of tools. You need high-cardinality, dynamically sampled events that help you trace a request as it hops between services and storage systems, that help you slice and dice by unique userids or app ids combined with every/all other dimensions. You need to be able to select individual database queries and then fetch all the context of the entire request through all other services to figure out where it came from and why.

At Honeycomb, we started out by writing a storage engine — a distributed, write-optimized columnar store that does aggregation at read time. People told us we were nuts. We knew we were nuts! You should NEVER write a database! But we also knew there was not a single other db system out there that could perform the storage and retrieval characteristics that we knew were both possible and non-negotiable.

Stop saying it’s not possible: look, we did it.

I would never knock the metric — it’s too valuable, too central to monitoring. But increasingly, monitoring isn’t enough either. More and more of us are experiencing the need for not just monitoring known problems, but true observability — the building blocks of introspection that help us explore the unknown-unknowns. You can’t run a complex distributed system without instrumentation and observability.

If you have these problems, give Honeycomb a try. You might be surprised how easy things are with the right tool…

Have thoughts on this post? Let us know via Twitter @honeycombio.