Best Practices for Observability

Observability has been getting a lot of attention recently. What started out as a fairly obscure technical term, dragged from the dusty annals of control theory, has been generating attention for one simple reason: it describes a set of problems that more and more people are having, and that set of problems isn’t well-addressed by our robust and mature ecosystem of monitoring tools and best practices.

In a prime example of “this may be frustrating and irritating, but this is how language works” — observability, despite arriving on the computer architecture scene much later than monitoring, turns out to actually be a superset of monitoring, not a subset.

Monitoring is heavily biased towards actionable alerts and black box service checks — which is not to deny the existence of a long tradition of diagnostics or white box monitoring, some of which turn out to fit better underneath the emerging definition of observability, some of which do not.

Observability, on the other hand, is about asking questions. Any questions. Open-ended questions. Frustrating questions. Half-assed descriptions of a vague behavior from a sleepy user half a world away types of questions. Do you have the data at your fingertips that will let you dive deep and observe the actual behavior the user is reporting, from her perspective, and draw a valid conclusion about what is happening and why? Then your observability is excellent, at least for that answer.

Typically in the past we have not had access to that type of information. Answering specific questions has been effectively impossible. For most of our history, this has frankly been due to storage costs. The late generations of observability tooling have not been made possible by the discovery of some fantastic new computer science, they have been made possible by cheaper storage costs and made necessary by escalating complexity and feature sets (and architecture decoupling) of the services we observe. Those trends were also made possible by the cheapening of hardware, so it’s Moore’s law all the way down.

In the past, all we could afford to look at and care about was the health of a system. And we bundled all our complexity into a monolith, to which we could attach a debugger in case of last result. Now we have to hop the network between functions, not just between us and third parties.

The time has come for a more open-ended set of mandates and practices. Monitoring has provided us with a rich set of starting points to mine for inspiration.

looking back through the rear-view mirror

Observability requires instrumentation

A system is “observable” to the extent that you can explain what is happening on the inside just from observing it on the outside.

Observability is absolutely, utterly about instrumentation. The delta between observability and monitoring is absolutely the parts that are software engineering-focused. The easiest and most dependable way is to instrument your own damn code.

(Some people will get all high and mighty here and say the only TRUE observability consists of sniffing the network. These people all seem to be network-sniffing observability vendors, but there’s probably no correlation there. IMO sniffing can be super awesome but tcpdump output is hard to wrangle, and the highest signal-from-noise ratio right now comes from developers pointing to a value and saying, “that one, print that one.” Obviously one should remember that this process is inherently imperfect too, but in my experience it’s the best we’ve got. Not EVERYTHING goes over the network, ffs.)

What? You didn’t write your own database?

You can’t instrument everything, though. Presumably most of us don’t write our own databases (cough), although they do tend to be well-instrumented. Pulling the data out can be non trivial, but it’s incredibly worth the effort. More on this in the best practices list below.

So yeah–you can’t instrument everything. You shouldn’t try. That’s a better game for cheap metrics and monitoring techniques. You should try to instrument the most useful and relevant stuff, the stuff that will empower you to ask rich, relevant questions. Like, instead of adding a counter or tick or gauge for everything in /proc (lol), focus on the high-cardinality information, timing around network hops, queries.

(A little of this will be a bit Honeycomb-specific (also somewhat Facebook-specific, with a splash of distributed tracing-specific), because these are the tools we have. Much like early monitoring manifestos annoyingly refer to “tags” and other graphite or time-series implementation specifics. Sorry!)

Guiding principles for observability

  • The health of each end-to-end request is of primary importance. You’re looking for any needle or group of needles in the haystack of needles. Context is critically important, because it provides you with more and more ways to see what else might be affected, or what the things going wrong have in common. Ordering is also important. Services will diverge in their opinion of where the time went.
  • The health of each high-cardinality slice is of next-order importance (for each user, each shopping cart, each region, each instance ID, each firmware version, each device ID, and any of them combined with any of the others.)
  • The health of the system doesn’t really matter. Leave that to the metrics and monitoring tools.
  • You don’t know what questions you’re going to have. Think about future you, not current you.

Best practices for observability

  • You must have access to raw events. Any aggregation that is performed at write time is actively harmful to your ability to understand the health and experience of each request.
  • Structure your logs/events. I tend to use “event” to refer to a structured log line. It can be either submitted directly to Honeycomb via SDK or API, or you can write it out to a log and tail it/stream it to us. Unstructured logs should be structured before ingestion.
  • Generate unique request IDs at the edge, and propagate through the entire request lifecycle (including to your databases, in the comments field)
  • Generate one event per service/hop/query/etc. A single API request should generate, for example, a log line or event at the edge (ELB/ALB), the load balancer (nginx), the API service, each microservice it gets passed off to, and for each query it generates on each storage layer. There are other sources of information and events that may be relevant when debugging (e.g. your DB likely generates a bunch of events that say how long the queue length is and reporting internal statistics, you may have a bunch of system stats stuff) but one event per hop is the current easiest and best practice).
  • Wrap any call out to any other service/data store as a timing event. In Honeycomb, stash that value in a header as well as a key/value pair in your service event. Finding where the system has gotten slow will usually involve either DT or comparing the view from multiple directions. For example, a DB may report that a query took 100ms, but the service may argue that it actually took 10 seconds. They can both be right …. if the DB doesn’t start counting time until it begins executing the query, and it has a large queue.
  • Incentivize the collection of lots of context, because context is king. Each event should be as wide as possible, with as many high-cardinality dimensions as possible, because this gives you as many ways to identify or drill down and group the events and other similar events as possible. Anything that puts pressure on your developer to collect less detail or select only a limited set of attributes to index or group by, is the devil. You’re hunting for unknown-unknowns here, so who knows which unknown will turn out to be the key?
  • Adopt dynamic sampling … from day one. To control costs, and prevent system degradation, and to encourage right-thinking in your developers. All operational data should be treated as though it’s sampled and best-effort, not like it’s a billing system in terms of its fidelity. This trains you to think about which bits of data are actually important, not just important-ish. Curating sample rates is to observability as curating paging alerts is to monitoring — an ongoing work of art that never quite ends.
  • When you can’t instrument the code — look for the instrumentation provided, and find ways of extracting it. For example, for mysql, we usually stream events off the wire, heavily sampled, AND tail the slow query log, AND run mysql command line commands to dump innodb stats and queue length. Shove em all into a dataset. Same for mongodb: at Parse we printed out all mongodb queries with debugLevel = 0 to a log on a separate block device, rotated it hourly, sampled heavily and streamed off to the aggregator… and we ran mongo from the command line and printed out storage engine statistics, queue length, etc and injected those into the same dataset for context.

Coming soon, the logical part 2: “Monitoring’s best practices are today’s observability anti-patterns.” Until then, give Honeycomb a try!

Have thoughts on this post? Let us know via Twitter @honeycombio.