See the power of Honeycomb and OpenTelemetry.
With more and more people adopting OpenTelemetry and specifically using the tracing signal, I’ve seen an uptick in people wanting to add the entire request and response body as an attribute. This isn’t ideal, as it wasn’t when people were logging the body as text logs. In this blog post, I’ll explain why this is a bad idea, what are the pitfalls, and more importantly, what you should do instead.
Choosing the right signal for the job
The first thing I want to address is that traces (or spans, as the data object) don’t have to be the only signal you use when understanding your system. Telemetry signals are designed (in OpenTelemetry) to be correlated; your job is to choose the right signal for the job. For example:
- If it’s about understanding the system, traces/spans are likely the best signal since they have more context that’s then queryable and graphable.
- If you know what visualization you want to create ahead of time (like a count of requests by http route template), then metrics might be a better option.
- If you have no calling context, wide structured logs would be better.
However, all of these signals have something called resource context, which allows you to correlate the different signals around the thing that served them. Additionally, for logs and traces, there’s a further correlation you can use in traceId
and spanId
.
There is one big advantage of storing all your data in a single signal type: arbitrary investigation—or, put simply, debugging production. When all your data is in a single signal and that signal is a single event, you can do way more interesting investigations to find correlations and causations in order to ultimately find anomalies.
We’ll keep that in mind as we talk about intentionality in building our software.
In my experience, logging the request body is basically a catch-all. It’s used as a way to be able to access the data from the request without having to do any work in the code to understand what the system is working with. It’s a way of saying, “Look, you have the data, go work it out.”
Why is it bad?
In my opinion, there are three reasons this is a bad idea. I’ll caveat this with the fact that it’s possible, just mostly inefficient.
More data than you need
You’re likely providing way too much data than you actually need. Allocating that data in memory twice (once for the application and once to send it onwards) is inefficient from both an overall memory perspective and from a response time/latency perspective. This could be seen as necessary overhead, but there are better ways to achieve that goal.
Sending personal data
You’ll likely end up sending data that shouldn’t be persisted anywhere. The main example of this is Personally Identifiable Information (PII). As soon as you start adding the full request (or response) body, you have no ability to control whether your observability backend is in scope for GDPR, CPRA, etc. Worse still, you could reveal and store sensitive data such as plain text passwords from POST data to forms, or financial data. Even if you limit this to certain inputs, you’re one misconfiguration away from storing all that data.
You may come to the conclusion that the answer is to restrict access to that observability backend. However, that simply creates a bottleneck for resolving issues since only a few users will now have access. You’ll also still have the issue that now, your observability backend is in scope for things like PCI-DSS if you take payments.
It’s always about money
Think about the cost multiplier Charity spoke about in her cost crisis piece. This comes in multiple forms, from the cost of storage in your observability backend to egress costs—and a potential increase in compute costs caused by the issues outlined in the first point.
So, what should we do?
Observability is about understanding and answering questions about how your production system is functioning. The important thing to consider with this is “who” can ask those questions, from both an access side (do they even have access to the telemetry data) and from an ability side. Labeling the data, and working on understanding that whole team has, is crucial.
So what does that mean in practice? It means intentionally looking at your code and adding attributes with explicit names when they’re useful and risk-free.
Add extension methods, helper classes that allow you to take your request and response body, but explicitly extract just the data that is important while giving it names. Writing the code at this level allows you to provide your knowledge of the system at the time you’re writing it, to your future debugging self. You can provide everything from consistent naming to filtering the available data to ensure you’re not exposing too much. All of this will help you (and anyone else) who needs to debug the system in the future.