Dear Miss O11y,
I remember reading quite interesting opinions from you about usage of metrics and traces in an application. Did you elaborate on those points in a blog post somewhere, so I can read your arguments to forge some advice for myself? I must admit that I was quite puzzled by your stance regarding the (un)usefulness of metrics compared to traces in apps in some contexts (debugging).
Engineer in Travel Technology via Mastodon
This is a truly excellent question, and I haven’t really put my thoughts in text form. This feels like as good a place as any!
My main gripe with metrics is that they lose all the interesting context in both directions:
- You lose dimensionality as the more metric labels you add, the harder it is to maintain the backend as metrics systems aren’t built for it. Because of that, engineers tend to stick with some basic labels.
- You lose the granularity of seeing what individual requests or processing contexts are doing as they aggregate over time.
What are metrics?
This might sound obvious, but metrics are not graphs. Metrics are time-series data, aggregated by time windows, stored based on a small, predefined set of labels (context). A ton of the people I’ve talked to over the years conflate graphs on dashboards as metrics. You can, however, generate a graph based on tracing data… if you have systems that are built for running those queries fast.
The primary benefit of metrics over traces is that they are cheaper to store. With metrics, you’ve removed and aggregated all that context, so you’re only storing numbers. We’ll get into traces later.
When are metrics good?
Metrics are great when you don’t have context. This means things like CPU Time, Memory used, Queue Length, etc. These don’t have a start point, or a “context” that we can hang things off. They’re also what I would consider useful, as you can use them to understand when you need to scale up your system because you’re using too much CPU/Memory, or scale out because your queue is getting too long. You’re not, however, using them to understand your system and users: you’re using them to understand the infrastructure that supports them.
When are metrics bad?
Metrics aren’t necessarily ever “bad.” It’s more a case of them being a lot less useful compared to the raw data from tracing.
Metrics are based on you defining the dimensions you want to query ahead of time. To put that another way, you need to know what questions you want to ask when things go wrong while you’re writing the code (known-knowns). I’m not sure about you, but my precognitive abilities are unfortunately still a little primitive.
The difference is that when querying raw data, you can slice and query it in any way you want, instead of having to define those queries up front. Want to know how many users are getting errors by their user-agent, language pack, and URL? With metrics, you’re out of luck as you need a new release to add those labels. If you already had those, either your foresight is amazing, or you knew that was going to be a problem.
But traces are expensive and metrics are cheap, so I can have more!
This is the most common argument for metrics over traces that I hear. This is true, there’s no denying it: it’s expensive to store the trace data for every request—which is why you shouldn’t. Sampling of traces is far superior to aggregation using metrics as you don’t lose context or granularity. Providing that your observability backend supports it, you don’t lose the visibility of the overall volumes.
Where do logs fit in?
My honest opinion is that they don’t. Logs relate to point-in-time data, and as such, lack the context we get from traces. In almost any use case for logs, a span would be a better construct as it has more data and more utility.
If you do have point-in-time events, these are better represented as span events than logs. This means that they’re part of the trace, and inherently, you have all the surrounding context. A good example of this is exceptions in code, as they happen at a point inside a span, and have contextual data like the message, stack trace, etc. Keeping exceptions inside the trace, as events that are queryable, means that when you’re looking at them you have all the surrounding context to understand the entire flow that led to—and followed—the error.
Essentially, logs are just spans without context—or said another way, spans are just fancy logs.
The other area that logs are useful for is in audit platforms, and areas like firewall logs. These are not attached to trace contexts and therefore structured logs from these can be useful events to query independently.
TL;DR: The answer is to trace
When it comes to observable systems, there’s just no competition for traces. Without traces, we lose precious context—arguably one of the most important things. Use metrics where you have no context, use them where they provide benefit, and correlate from traces to metrics when that contextless data has relevance, not the other way around.
Forget about application logs, upgrade to spans.
Want to see distributed tracing done right? Sign up for our very useful and generous free tier and start exploring for yourself.