This guest post from Colin Curtin of Good Eggs is the third in our series on the how, why, and what of events.
On Event Construction
I like to think of it as switching to my Historian hat: from this mountain of data I need to construct a coherent narrative.
One of the most important facts/attributes to think about are what I’ve been calling “through lines”, unique keys that will let you trace linear sequences of behavior through a system.
For example, in warehouse land, the unit of work that is “a particular item needs to be picked from a location into a container” – we call that a “pick instruction”. Pick instructions have a UUID that, when searched, will yield its various touch points from creation to termination (the item was picked).
Almost every operation that references or modifies pick instructions will log an event about:
- A description of what it is about to do or has done (“Calculating inventory assignment”, “Saved inventory assignment”)
- The context surrounding it
Context here is:
- Variables passed to the method, most (all?) of our methods “Doing Work” take only simple/JSON-able values
- Counts of collections we have in scope
- UUIDs of relevant objects
The UUID is key (yep, it’s a pun). It doesn’t have to be a “UUID”: it could be a tagged id a la Stripe, or a request id that gets assigned to every HTTP request and passed through your layers–just as long as when you search for it, you will only find events related to it.
On Process
It is hard to choose a granularity for logging events. At one extreme, you could log after every line. (AOP makes this easy! /s) So we have established an upper bound.
And why not log after every line? Here is where we get into signal/noise, effort/reward.
Let’s say that you find yourself with a good enough log aggregator that lets you log as many damn variables as you please, we’ve got the widest logs, whoa, and you’re looking at a code base with little or logging of questionable virtue… where do you start?
Well, let’s take your very most recent bug.
Why was it a bug, and don’t say human error (c.f. Understanding Human Error). The system didn’t perform to an expectation of its behavior. Great.
How long did it take to:
- Get the first report of the bug (from releasing it to production)
- Reproduce the behavior
- Trace its behavior to its cause
- Resolve/Mitigate cause
With event logs, we’re looking to decrease the time from steps 3 to 4. (Some might also posit decreasing in the future, all steps, with enough event logs…) This is your Time To Resolution (TTR).
Given what you know now about this bug, and about how long it took to track down, let’s work backwards.
If you could log anything at all, in any line of code, to help you find what was causing this issue, what line or lines would it be? What would you have sent your past self? What search terms would you have used?
This is a variant of SEO. For these data, what should the search terms be for our coherent narrative? What is your ideal “SERP” here?
Around this time you should be wondering, okay, but this is just one bug. What about the next one?
Well thankfully the time that you invest doing Root Cause Analysis that feed back into your debugging processes will pay off huge dividends. Each time you do this, you will grow and prune your way into the right events logged; or at least, the most troublesome.
So do it! You can go back in time, assuming you have an issue tracker (!). Most recent is better since you’ll still have more context surrounding it in your brains.
And as for future bugs, a lot of building more resilient systems is about extracting as much information as you can about failure, and using it to shape your current human processes. Not just “how do we prevent this?” but: “how do we fix this (and its class) as fast as possible?” and “how do we find out about it before anyone else?”.
Build TTR minimization into your outage retrospectives and root cause/five whys template. Track M(ean)TTR and report on it as part of your team’s metrics to show that you’re getting better at it.