This guest post from Matt Klein of Lyft is the fourth in our series on the how, why, and what of events.
Event based tracing, logging, and debugging are very powerful tools for distributed systems development and operations. However, without putting careful thought into the emitted events and the form they take, it’s easily possible (and all too frequent) that the output ends up producing an incomprehensible morass with little value.
These few small tips, in my experience, go a long way towards preventing poor outcomes and realizing maximal value.
Less Is More
Emitting too many events is just as bad as too few.
Too many events will ultimately end up overwhelming the ingestion system (no matter how good it is) and will lead to cognitive overload when debugging. The best way to accomplish this is to output events at various verbosity levels:
- During development it is natural to run at TRACE or DEBUG level
- In production it is more common to run at INFO or WARN level
- Be very liberal with TRACE and DEBUG events, but extremely judicious with events at INFO or higher level
Each one should tell a story and be useful for post-mortem analysis. A good way of checking this is to locally run a project’s unit/integration tests at production verbosity level to make sure the outputted events are individually actionable. E.g. seeing “entering UserLogin::doLogin()” in production logs adds no value, but seeing “user logged in [uid=123, type=oauth]” does add value.
Link/Combine Events When Possible
At even medium scale, event logs become too large and distributed to look at without some type of distributed analysis system. Design events for this eventuality.
Optimally, events should be self describing and contain all of the pieces of data to make them actionable. If that is not possible, they should have cross linking IDs embedded in them that make joining possible. Examples of join IDs are things like thread ID, connection ID, user ID, etc. It is a common mistake for events to be emitted in close proximity that could have been combined into a single event. Take every opportunity to reduce complexity and cognitive load.
Use Metrics When Counting in One Dimension
Metrics are much less expressive but substantially cheaper than events and logs. If data is being emitted for the sole purpose of counting across a single dimension, it should be in the form of a metric (counter). For example, the total number of users that log in per unit of time is a good candidate for a metric. Telling a story about specific users is a good candidate for an events.
Following this advice will make whoever runs your observability systems very happy.
Structure Messages for Searching Without Complex Regular Expressions Whenever Possible
I know we’re on the Honeycomb blog, but most people are not yet fortunate enough to work with an event analysis system that is structured by nature.
Dealing with events inevitably results in complicated searches. Limiting the complexity as much as possible up front will save time later. Consider the following three event logs:
user 123 logged in
user logged in [uid=123]
{'msg': 'user logged in', 'id': 123}
These three event logs tell the same story and take about about the same amount of space in number of characters. However, they have hugely different implications in terms of post processing. The first one is the worst. It’s hard to search for because many other logs might start with “user”. Extracting the user ID will require a complicated regex capture. The second is better. It’s still textual, but the message is the prefix and the associated data is the postfix. This message is easier to search for and easier to extract data out of. The third is JSON. It’s structured by nature. Not only is it easier to search with regular text processing tools, the data can be loaded into substantially more capable analysis systems for in depth debugging and introspection.