Get your free copy of Charity’s Cost Crisis in Metrics Tooling whitepaper.
Part 2: Observability cost drivers and levers of control
I recently wrote an update to my old piece on the cost of observability, on how much you should spend on observability tooling. The answer, of course, is “it’s complicated.” Really, really complicated. Some observability platforms are approaching AWS levels of pricing complexity these days.
In last week’s piece, we talked about some of the factors that are driving costs up, both good and bad, and about whether your observability bill is (or should be) more of a cost center or an investment. In this piece, I’m going to talk more in depth about cost drivers and levers of control.
Business cost drivers vs technical cost drivers
The cost drivers we talked about last week, and the cost drivers as Gartner frames them, are very much oriented around the business case. (All Gartner data in this piece was pulled from this webinar on cost control; slides here.)
In short, observability costs are spiking because we’re gathering more signals and more data to describe our increasingly complex systems, and the telemetry data itself has gone from being an operational concern that only a few people care about to being an integral part of the development process—something everyone has to care about. So far, so good. I think we’re all pretty aligned on this. It’s a bit tautological (companies are spending more because there is more data and people are using it more), but it’s a good place to start.
But when we descend into the weeds of implementation, I think that alignment starts to fray. There are technical cost drivers that differ massively from implementation to implementation.
Executives may not need to understand the technical details of the implementation decisions that roll up to them, but observability engineering teams sure as hell do. If there’s one thing we know about data problems, it’s that cost is always a first class citizen. There is real money at stake here, and the decisions you make today may reverberate far into the future.
Model-specific cost drivers
The pillars model vs consolidated storage model (“observability 2.0”)
Most of the levers that I am going to talk about are vendor-neutral and apply across the tooling landscape. But before we get to those, let’s briefly talk about the ones that aren’t.
The past few years have seen a generational change in the way instrumentation is collected and stored. All of the observability companies founded pre-2020 (except for Honeycomb and, it seems, New Relic?) were built using the multiple pillars model, where each signal type gets collected and stored separately. All of the observability companies founded post-2020 have been built using a very different approach: a single consolidated storage engine, backed by a columnar store.
In the past, I have referred to these models as observability 1.0 and observability 2.0. But companies built using the multiple pillars model have bristled at being referred to as 1.0 (understandably). Therefore, as I recently wrote elsewhere, I will refer to them as the “multiple pillars model” and the “unified or consolidated storage model, also called observability 2.0” moving forward.
Why bother differentiating? Because the cost drivers of the multiple pillars model and unified storage model are very different. It’s hard to compare their pricing models side by side.
Controlling costs under the multiple pillars model
When you are using a traditional platform that collects and stores every signal type in a separate location, your technical cost drivers are:
- How many different tools you use (this is your cost multiplier)
- Cardinality (how detailed your data is)
- Dimensionality (how rich the context is)
Your number one cost driver is the number of times you store data about every incoming request. Gartner tells us that their customers use on average 10-20 tools apiece, which means they have a cost multiplier of 10-20x. Their observability bill is going up 10-20x as fast as their business is growing! This alone explains so much of the exponential cost growth so many businesses have experienced over the past several years.
Metrics-heavy shops are used to blaming custom metrics for their cost spikes, and for good reason. The “solution” to these billing spikes is to delete or stop capturing any high-cardinality data.
Unfortunately, high-quality observability is a function of detail (cardinality) and context (dimensionality). This is a zero-sum game. More detail and context == better observability; less detail and context == worse observability.
The original sin of the multiple pillars model is that two of the primary drivers of cost are the very things that make observability valuable. Leaning on these levers to rein in costs cannot help but result in materially worse observability.
Controlling costs under the unified storage model
When you are using a platform with a unified, consolidated storage model (also known as observability 2.0), your cost drivers look very different. Your bill increases in line with:
- Traffic volume
- Instrumentation density
Instrumentation density is partly a function of architecture (a system with hundreds of microservices is going to generate a lot more spans than a monolith will) and partly a function of engineering intent.
Areas that you need to understand on a more granular level will generate more instrumentation. This might be because they are revenue-generating, or under active development, or because they have been a source of fragility or problems in your stack.
Your primary levers for controlling these cost drivers are consequently:
- Sampling
- Aligning instrumentation density with business value
- Some amount of filtering/aggregation, which I will sum up as “instrumenting with intent”
Modern sampling is a precision power tool—nothing like the blunt force trauma you may remember from decades past. The workhorse of most modern sampling strategies is tail-based sampling, where you don’t make a decision about whether to keep the event or not until after the request is complete. This allows you to retain all slow requests, errors, and outliers. Tuning sampling rules is, of course, a skill set in its own right, and getting the settings wrong can be costly.
It is a simplification, but not an unreasonable one, to say that under the three pillars model you throw away the most important data (context, cardinality) to control costs, and with observability 2.0, you throw away the least important data (instrumentation around health checks, non customer-facing services) to control costs.
There’s a perception in the world that observability 2.0 is expensive, but in our experience, customers actually save money—a lot of money—as a side effect of doing things this way. If you use a lot of custom metrics, switching to the 2.0 way of doing things may save you 50% or more off the top, and the rate of increase should slow to something more aligned with the growth of your business.
Model-agnostic levers for cost control begin with vendor consolidation
Ok, enough about model-specific cost drivers. Let’s switch gears and talk about cost control from a vendor-agnostic, model-agnostic point of view. This is a conversation that always starts in the same place: vendor consolidation.
Vendor consolidation is a necessity and an inevitability for so many companies. I have talked to so many engineers at large enterprises who are like, ”name literally any tool, we probably run it somewhere around here.” To some extent, this is a legacy of the 2010s, when observability companies were a lot more differentiated then they are now and lots of customers were pursuing a “best of breed” strategy where they looked around to adopt the best metrics tool, best tracing tool, etc.
Since then, all the big observability platforms have acquired or built out similar offerings to round out their services. Every multiple pillars platform can handle your metrics, logs, traces, errors, etc., which has made them less differentiated.
There can be good reasons for governing devtool sprawl with a light touch—developer autonomy, experimentation, etc. But running many different observability tools tends to get expensive. Not only in terms of money, but also in terms of fragmentation, instrumentation overhead, and cognitive carrying costs. It’s no wonder that most observability engineering teams have been tasked with vendor consolidation as a major priority.
Vendor consolidation can be done in a way that cuts your costs or unlocks value
There are two basic approaches to vendor consolidation, and these loosely line up with the “investment” vs “cost center” categories we discussed earlier.
Companies that make their money off of software are more likely to treat consolidation as a developer experience problem. They see how much time gets lost and cognitive packets get dropped as engineers spend their time jumping frantically between several different tools, trying to hold the whole world in their head. Having everything in one place is a better experience, which helps developers ship faster and devote more of their cognitive cycles to moving the product forward.
Companies where software is a means to an end, or where the observability budget rolls up to IT or the CIO, are more likely to treat observability as a cost center. They are more likely to treat all vendors as interchangeable, and focus on consolidation as a pure cost play. The more they buy from a single vendor, the more levers they have to negotiate with that vendor.
Right now, I see a lot of companies out there using vendor consolidation as a slash and burn technique, where they simply make one top-down decision about which vendor they are going to go with, and give all engineering teams a time window in which to comply. This decision increasingly seems to take place at the exec level rather than at the engineering level, sometimes even CEO to CEO.
I think this is unfortunate (if understandable, given the sums at play). I think that vendor consolidation can be done in a way that unlocks a ton of value. I also think that in order to unlock that value, the decision really needs to be owned and thoroughly understood by a platform or observability engineering team who will be responsible for unlocking that value over the next year or two.
Telemetry pipelines as a way to orchestrate streams and manage costs
Telemetry pipelines have been around for a while, but they’ve really picked up steam lately. I think they’re going to be a key pillar of every observability strategy at scale.
Telemetry pipelines often start off as a way to route and manage streams of data at a higher level of abstraction, but they also show a ton of promise in the realm of cost containment. Just a few of the many capabilities they unlock:
- Make it easier to define tiers of instrumentation, and assign services to each tier
- Make it easier for observability engineering teams to practice good governance at scale
- Make it easier to visualize and reason about where costs are coming from
- Get your signals into an OTel-compatible format without having to rewrite all the instrumentation
- Make decisions earlier in the pipeline about what data you can discard, aggregate, sample, etc.
- Offload raw data to a cheap storage location and “rehydrate” segments on demand
- Leverage AI at the source to help identify outliers and capture more telemetry about them
- Create feedback loops to train and improve your instrumentation based on how it’s actually getting consumed in production
This doesn’t have to be an all or nothing choice, between stripping all the context and detail at the source (like metrics do) or storing all the details about everything (structured log events/traces). Pipelines bridge this gap. There’s a lot of activity going on in this space, and I think it shows a ton of promise.
These are going to require us all to learn some slightly different skills—to think about data management in different ways; ways more like how business analytics teams are accustomed to managing their data than the way ops teams do. Telemetry pipelines are going to emerge as a place where a lot of decisions get made.
Over the long run, I think observability is moving towards a “data lakehouse” type model. Instead of scattering our telemetry across dozens of different isolated signal types and custom storage formats, we’ll store the data once in a unified location, preserving rich context and connective tissue between signal types.
What role does OpenTelemetry play in cost management?
In a word: optionality. Historically, the cost of ripping one vendor out and replacing it with another was so massive and frustrating that it kept people locked into vendor relationships they weren’t that happy with, at price points that became increasingly outrageous.
If you invest in OpenTelemetry, you force vendors to compete for your business based on being awesome and delivering value, not keeping you trapped behind their walls against your will.
That’s mainly it. But I think that’s a pretty big reason.
Note that OpenTelemetry does not solve the problem of data gravity, because observability is about much more than just instrumentation. Changing vendors will also involve changing alerts, dashboards, bookmarks, runbooks, documentation, workflows, API calls, mental models, expertise, and more. It’s not as hard as changing your cloud provider, but it’s not as easy as switching API endpoints. There are things you can do to ameliorate this problem, but not solve for it. (This stickiness is one of the less-savory reasons that I hypothesize that bills have risen so far, so fast.)
As time goes on and the world adjusts to OpenTelemetry as lingua franca, my hope is that more of the sticky bits will unstick. Decoupling custom vendor auto-instrumentation from their custom generated dashboards will help, as will moving from dashboards to workflows.
Tiered instrumentation
If you’re in charge of observability at a large, sprawling enterprise, you’re going to want to define tiers of service. This is actually Gartner’s top recommendation for controlling costs: “Align to business priorities.” They suggest breaking down your services into groups according to how much observability each service needs. Their example looks like this:
- Top tier (5%): External-facing, revenue-generating applications requiring “full” observability: metrics, logs, tracing, profiling, synthetics, etc.
- Mid tier (35%): Important internal applications needing infrastructure, logging, metrics, and synthetics
- Low tier (65%): Internal-only applications requiring just synthetics and metrics
I have no idea where they got those percentages from—the distribution strikes me as bizarre, and I’m not big on synthetics beyond simple end-to-end health checks—but the concept is sound.
Here’s how I would think of it. You need rich observability in higher fidelity for services or endpoints that are:
- Under active development or collaboration
- Customer-facing
- Sensitive to latency or user experience
- Revenue-generating
- Services that tend to break, or change frequently
Services that are stable, internal-facing, offline processing, etc., don’t need the works. Maybe SLOs, monitoring checks, or (in an observability 2.0 world) a single, arbitrarily-wide structured log event, per request, per service.
Other Gartner recommendations
Gartner made three more recommendations in this webinar, which I will pass along to you here:
- Audit your telemetry
- Implement vendor-provided cost analysis tools and access controls
- Rationalize and consolidate tools
They suggest you can save 10-30% in telemetry costs through regular audits and cost management practices, which sounds about right to me.
I don’t love relying on ACLs to control costs, because I’m such a believer in giving everyone access to telemetry tooling, but I recognize that this is the world we live in.
Is open source the future?
I recently wrote the foreword to the upcoming O’Reilly book on Open Source Observability. In it, I wrote:
A company is just a company. If your ideas stay locked up within your walls, they will only ever be a niche solution, a power tool for those who can afford it. If you want your ideas to go mainstream, you need open source.
People need options. People need composable, flexible toolkits. They need libraries they can tinker with and take apart, code snippets and examples to tweak and play with, metered storage that scales up as they grow, and more. There is no such thing as a one-size-fits-all solution, and people need to be able to cobble together something that meets their specific needs.
There seems to be a recent uptick in the number of companies thinking about bringing observability in-house. Why? I think it’s partly due to the flowering of options. When we started building Honeycomb in 2016, we built our own columnar storage engine out of necessity. Now, people have options when it comes to columnar stores, including Clickhouse, Snowflake, and DuckDB.
I think the cost multiplier effect puts the whole multiple pillars model on an unsustainable cost trajectory. Not only is it intrinsically, catastrophically expensive, but as the number of tools proliferates, the developer experience deteriorates. I think a lot of people are catching on to the fact that logs—wide, structured log events, organized around a unit of work—are the bridge between the tools they have and the observability 2.0-shaped tools they need. And running your own logging analytics just doesn’t sound that hard, does it?
However, I predict that most larger enterprises will ultimately steer away from building their own. Why? Because once you count engineering cycles, it stops being a bargain. My rule of thumb says it will cost you $1m/year for every three to five engineers you hire. There are a limited number of experts in the underlying technologies, and the operational threshold is higher than people think. Sure, it’s not that hard to spin up and benevolently ignore an ELK stack… but if your reliability, scalability, or availability needs are world-class, that’s not good enough.
On the other hand… Infrastructure is cheap, and salaries are predictable. So maybe open source is the glorious future we’ve all been waiting for. Either way, I think open source observability has a bright, bright future, and I’m excited to welcome more tools to the table. Having more options is good for everyone.
Observability engineering teams, meet data problems
One prediction I feel absolutely confident in making is that observability engineering teams are poised to thrive over the next few years.
Observability engineering has emerged as perhaps the purest emanation of the platform engineering ethos, bringing a product-driven, design-savvy approach to the problems of developer experience. Their customers are internal developers, and their stakeholders include finance, execs, SRE, frontend, mobile, and everyone else.
As the scope, mandate, budget, and impact of observability engineering teams continues to surge, I think the other element that these teams are going to need to skill up on are skills traditionally associated with data engineering.
These are, after all, data problems. And the cheapest, fastest, simplest way to solve any number of data woes is to fix them at the source, i.e. emit better data. Which runs up headfirst against most software organizations’ deeply ingrained desire to leave working code alone, lean on magic auto-instrumentation, and just generally think about their telemetry as little as possible.
Observability engineering teams thus sit between a rock and a hard place. But I think — I hope! — that clever teams will creatively leverage tools like AI and telemetry pipelines to identify ways to bridge this gap, to lower the time commitment, risk and cognitive costs of instrumentation, so that telemetry becomes both easier and more intentful. Good observability engineering teams will accrue significant political capital in the course of their labor, and they will need every speck of it to guide the org towards richer data.
Earlier I mentioned that there is a perception that o11y 2.0 is particularly expensive. I find this frustrating, because it doesn’t match my experience. But cost is a first class consideration of every data problem, always. There is no such thing as a “best” or “cheapest” data model, any more than you can say that Postgres is “better” or “cheaper” than Redis, or DynamoDB, or CockroachDB; only ones that are more or less suited for the workload you have (and better or worse implementations).
A few caveats and cautionary tales
Be wary of any pricing model that distorts your architecture decisions. If you find yourself making stupid architecture decisions in order to save money on your observability bill, this is a smell. One classic example: choosing massive EC2 instances because you get billed by the number of instances.
Be wary of any pricing model that charges you for performing actions that you want to encourage. Like paying for per-seat licenses (when you want these tools to be broadly adopted), or for running queries (when you want people to engage more with their telemetry in production).
Be mindful of what happens when you hit your limits, or what happens to your bill when things change under the hood. Make sure you understand what happens with burst protection and overage fees, and be wary of things like cardinality bombs, where you can go to bed on a Friday night feeling good about your bill and wake up Monday owing 10x as much without anyone having shipped a diff or intentionally changed a thing.
Be skeptical of cost models where the vendor converts prices into some opaque, bespoke system of units that mere mortals do not understand (“we charge you based on our ‘Compute Consumption Unit’”).
Simpler pricing is not always better pricing. More complicated pricing schemes can actually be better for the vendor and the customer by letting you align with what you actually use and what it costs the vendor to serve you.
Be wary of price quotes pulled from the website. Any website. Engineers have a tendency to treat website quotes like gospel, but nobody pays what it says on the website. Everybody negotiates deals.
People are fond of pulling up metrics price quotes and saying, “But I get hundreds of metrics per month for just a few cents!” No, the number of metrics per month gets layered on top of ingest costs, bandwidth, retention, data volume, and half a dozen other pricing levers. (And the “number of metrics” is actually referring to the number of time series values.)
As Sam Dwyer says: “Beware vendor double-dipping where they charge you for multiple types of things—for example, charging for both license per user and data consumption.” If you are using a traditional vendor, they are probably charging you for many different dimensions at once, all stacked.
So about that rule of thumb
In 2018, I wrote a quick and dirty post where I shared my observation that teams with good observability seemed to spend 20-30% of their infra bill on observability tools.
This entire 7000-word monstrosity on observability costs started off as an attempt to figure out how much (or if) the world has changed since then. After 2.5 months of writing, researching, and talking to folks, I have arrived at this dramatic update:
Teams with good observability seem to spend 15-25% of their infra bill on their observability tools.
I have heard at least one analyst (who I respect) and two or three Twitter randos (who I do not) state they believe a number like 10% should be achievable.
I am not so sure. For now, I stand by my observation that companies with good observability tooling seem to spend somewhere between 15-25% of their infra bill to get it. I don’t think this rule of thumb should scale up linearly over $10m/year or so in infra bills, but I honestly have no solid data either way.
(If you work at a large enterprise and would like to show me what it looks like to spend 10% and get great observability, please send me an email or DM! I would love to learn how you did this!)
Maybe it’s not (primarily) about the money
If I had to guess, I’d say the absolute cost is less of a big deal to large, profitable enterprises than the seemingly unbounded cost growth. Finance types are annoyed that costs keep going up at an escalating rate, while engineering types are more irritated by the fact that the value they get from their tools is not keeping pace, or does not seem worth it.
With the multiple pillars model, the developer experience may even go down as your costs go up. You pay more and more money, but the value you get from your tools declines.
What people really need are predictable costs that go up or down in alignment with the value you are getting out of them. Then we can start having a real conversation about observability as investment vs observability as cost center, and the (hidden) costs of poor observability.
In the long run, I don’t think we’re trending towards dramatically cheaper observability bills. But the rate of growth should ease up, and we should be wringing a lot more value out of the data we are gathering. These are entirely reasonable requests.
You can’t buy your way to great observability
It doesn’t matter how great the tool is, or how much you’re shelling out for it; if your engineers don’t look at it, it will be of limited value. If it doesn’t change the way you build and ship software, then observability is a cost for you, not an investment.
If you weave observability into your systems and practices and use it to dramatically decrease deploy time, test safely in production, collaborate across teams, enable developers to own their code, and give your engineers a richer understanding of user experience, these investments will pay handsome returns over time.
Your engineers will be happier and higher performing, covering more surface area per team. They will spend more time delighting customers and moving the business forward, and less time debugging, recovering from outages and coordinating/waiting on each other.
But buying a tool is not magic. You don’t get great observability by signing a check, even a really big one, any more than you can improve your reliability just by hiring SREs. Turning the upfront cost of observability tooling into an investment that pays off takes vision and work. But it is work worth doing.