The Cost Crisis in Metrics Tooling

The Cost Crisis in Metrics Tooling: Whitepaper Excerpt

8 Min. Read

The following is an excerpt from Charity Majors’ new whitepaper, The Cost Crisis in Metrics Tooling.

In my February 2024 piece The Cost Crisis in Observability Tooling, I explained why the cost of tools built atop the three pillars of metrics, logs, and traces—observability 1.0 tooling—is not only soaring at a rate many times higher than your traffic increases, but has also become radically disconnected from the value those tools can deliver. Too often, as costs go up, the value you derive from these tools declines.

This blog post struck a nerve. I heard from observability teams who spent the last year doing nothing but grappling with cost containment. I heard from engineers who scoffed at my anecdote about individual metrics that cost $30,000 per month and relayed hair-raising tales of metrics they shipped that each cost tens of thousands of dollars over the weekend. 

I also received many questions. This material is dense, and not widely understood. In this companion piece I will take a slower, deeper dive into the cost models and tradeoffs involved with metrics-backed tooling, since that remains the load-bearing pillar of most teams’ toolkits. I’ll also make the argument that metrics are a niche power tool in our arsenal.

Metrics are a (very) mature technology

When it comes to monitoring software, metrics-backed dashboards have long been the state of the art. Metrics are cheap, fast, and sparse, and the technology is decades old, so the tooling (and integrations) are extremely mature. Most APM and RUM tools are built using metrics primitives. Tools like Datadog, Prometheus, and Chronosphere are what most engineers reach for to understand their systems. The mental model for using metrics is not especially intuitive, but for historical reasons it is by far the most widely understood.

Teams are accustomed to instrumenting their software with metrics, deriving alerts from metrics, and using metrics-backed dashboards to debug their code. Logs get used for debugging, but they’re too unruly and expensive to use as a jumping-off point, and traces are perceived as too niche, expensive, or heavily sampled. At most, teams have learned to jump from SLOs to dashboards, to logs, to traces, visually correlating data by timestamps and the shape of spikes or copy-pasting IDs from tool to tool.

It strikes me as odd that metrics are still the dominant data type for systems and application data. In every other part of the business, our tools are backed by columnar stores or other relational databases, because we understand that data is made valuable by context.

In my opinion, the present state is a holdover from the days when engineering was seen as a cost center. I believe a shift is already well underway, bringing systems and application data into the relational fold—and I believe it is picking up speed thanks to soaring costs and the relentless explosion in underlying system complexity.

What exactly is a metric?

The term “metric” is commonly used in two different contexts: 

  • The “metric” is a data type, a number with some tags (or “labels”) appended, which can be stored in a variety of formats, such as counters, gauges, and histograms. 
  • “Metrics” is also often colloquially used as a generic synonym for telemetry data.

Metrics (the data type) are traditionally stored in a time-series database (TSDB), which is a collection of data points (numbers) gathered by time. The only type of index a TSDB has is an index by time, and the only type of queries you can run are point queries by time and range queries over time. It stores no relational or contextual data whatsoever.

All aggregation is performed at write time, not query time, including buckets like averages, 95th%, 99th%, 99.99th%, etc. If you query for the 99th percentile latency across your fleet of app services, you get an aggregate of aggregates of the locally-computed 99th percentile latency over a rolling window across all instances. If you want to query for the 99.95th% latency, or the 85th% latency, or any other latency that you did not compute at write time, you cannot.

When you install an agent, like StatsD or DogStatsD, it automatically ingests many system stats and churns out pretty graphs of CPU, memory, disk space, and the like with very little manual work. But nearly all the practical value you derive from these tools will come from instrumenting your code with custom metrics.

What is a custom metric?

Much like “metric,” the term “custom metric” has a colloquial meaning as well as a specific technical meaning. When an engineer talks about adding custom metrics to their code, they are typically conceptualizing each line of instrumentation as a custom metric. Which is why, when a metrics provider says you get a couple hundred custom metrics for free, that sounds like a lot!

Unfortunately, that’s not how time-series data works. When metrics are stored in TSDBs, every unique combination of metric name and tag values generates another distinct time-series, also known as a custom metric. This gets a little complicated, and can vary by implementation or backing store. 

Let’s take a simple example from the Datadog custom metrics billing page.

Calculating the footprint of an example metric

  statsd.increment('request_latency.increment', tags=[f'endpoint:{endpoint},status:{code}'])

Let’s say you submit a metric, request_latency, from five hosts with two tag keys, endpoint and status. You only monitor four endpoints on this tiny application and track two status codes, 200 and 500, and you decide to submit it as a count metric.

That comes out to a footprint of 40 custom metrics for this metric: 5 hosts * 4 endpoints * 2 status codes.

Now, let’s say you operate at a moderate scale. Your app runs on about 1000 hosts, and you monitor 100 endpoints, or 5 methods and 20 handlers. There are 63 HTTP status codes in
active use. 

We’re already up to 6.3 million custom metrics (1000 hosts * 5 methods * 20 handlers * 63 status codes) and the only thing we can do is a simple count of requests broken down by host/endpoint/status code. Oof. Let’s keep going. Counts are nice, but latency is what we’re trying to measure.

  statsd.histogram('request_latency.histogram', random.randint(0, 20), tags=[f'endpoint:{endpoint},status:{code}'])

If you submit request_latency as a histogram or distribution using nothing but the default aggregations max, median, avg, 95pc, and count, that’s 31.5 million custom metrics. You also want to compute the 99th, 99.5th, 99.9th, and 99.99th percentiles, right? Well, for every percentile bucket you want to compute at write time, you add another multiplier. If you want to store 10 buckets instead of 5, that’s a footprint of 63 million custom metrics.

A Datadog account comes with 100-200 custom metrics per host, depending on your plan. For every 100 ingested custom metrics over the allotment, you pay ten cents. That means you’d pay $63,000 per month just to collect barebones HTTP latency statistics. Keep in mind, we haven’t even tried to tag our metrics with anything really useful yet, like build ID or user ID.

Costs are hard to predict, and harder to connect to value 

One of the challenges with metrics is that calculating your metrics footprint is hard to do in advance, and may change out from under you. Engineering teams rely on policy documents, best practices, and expert code reviews to control costs, only to get bitten by seemingly unrelated changes made by infrastructure teams—or even autoscaling.

In my previous example, you have 1000 hosts and 20 handlers. Think about what happens to your bill when:

  1. Your infrastructure team moves your app tier from 1000 xlarge EC2 instances to 4000 on-demand containers
  2. Your on-call needs to roll your entire app tier a few times inside of an hour, causing several thousand EC2 instances to spin up briefly before dying
  3. You deploy some new code that adds versioning for each handler
  4. You change the value for histogram_aggregates or histogram_percentiles in your YAML config file, not realizing it will apply to ALL histograms
  5. You auto-generate a tag based off an AWS instance tag, which changes overnight to a different string format

For example, in #1, your bill quadrupled to $252,000/month without a single line of application code changing, and without any change in server-side capacity. How can it be so easy to accidentally quadruple your bill while making it 0% easier to understand or debug? 

High costs are a problem, yes, and so is unpredictability. But the worst part is when costs are so untethered from value. When your bill goes up, it should be a function of scaling up capacity and/or making your software and systems easier to understand.

Get the free whitepaper

There’s still a ton of info to go over on this topic. In the full whitepaper, we go over:

  • How experienced teams control costs
  • Using metrics for their intended purpose
  • The observability 2.0 cost model is very different
  • Structured logs are the bridge to observability 2.0
  • High cardinality is not the enemy! It’s your friend and ally
  • The sociotechnical consequences of better tooling
  • Further reading on the topic: articles and guides to increase your knowledge on costs

Get the free whitepaper
Continue your journey to lower costs.


Don’t forget to share!
Charity Majors

Charity Majors

CTO

Charity Majors is the co-founder and CTO of honeycomb.io. She pioneered the concept of modern Observability, drawing on her years of experience building and managing massive distributed systems at Parse (acquired by Facebook), Facebook, and Linden Lab building Second Life. She is the co-author of Observability Engineering and Database Reliability Engineering (O’Reilly). She loves free speech, free software and single malt scotch.

Related posts