Event Foo: Moar Context Better Events

This guest post from Mark McBride of Turbine Labs is the fifth in our series on the how, why, and what of events.

As a systems engineer, an undervalued part of your job is storytelling. Every time you wire up an event to be dispatched to your logging system, it’s a note to your future self. Unfortunately these aren’t stories you read bundled up all cozy like with a cocktail. You read them as you’re frantically trying to put together a plan to get a broken system running again.

Generating better events can dramatically improve your ability to respond to system incidents. As with any good piece of journalism, your story should cover the Five Ws. Most event generation frameworks will cover When (this thing happened yesterday at 3:23) and where (in the hash_blorping module). Some will even cover who (during a session for user dogpics99). What and why are nearly always exercises left to the engineer.

Let’s start with a simple event, in JSON, because why not.

{
“time”: “2017-06-14T20:44:04+00:00”,
 “location”: “hash_blorp.go:68”,
  “Message”: “oops”
}

I don’t have any great ideas on how to refine the concept of time in an event. The rest can all be improved upon pretty dramatically, and we’ll start with Where. File name/line number is a decent start. But that code is being executed on a machine somewhere, by a software build that came from somewhere, possibly handling a request for a specific bit of information. Capturing identifying process information makes your story richer. Capturing details about the process is even better, e.g. “this thing happened in line 43 of the hash_blorping module running in the dogpics service v1.1 in us-east-1c”.

Every new field you capture lets you play a little game of Guess Who with your service. Does this incident happen in all zones? For all running software versions?

{
  “time”: “2017-06-14T20:44:04+00:00”,
  “Location”: {
    “Region”: “us-east-1”,
    “Zone”: “us-east-1a”,
    “Host”: “10.0.0.4”,
    “Service”: “dogpics”,
    “Version”: “1.0.0”,
    “Endpoint”: “/dogs/search”
    “Params”: “good dogs”
  }
  “Message”: “oops”
}

Who is another concept that can be refined. A userid is a decent start. But like file name, it’s a very coarse notion of an actor. The same user may access your service from several devices. In a service oriented architecture, you may get calls on behalf of a user from a wide range of services. Capturing the originating address of the request, the user ID it’s on behalf of, and some description of the agent acting on behalf of that user (in HTTP-land, the User-Agent) gives you a richer picture.

{
  “time”: “2017-06-14T20:44:04+00:00”,
  “Location”: {
    “Region”: “us-east-1”,
    “Zone”: “us-east-1a”,
    “Host”: “10.0.0.4”,
    “Service”: “dogpics”,
    “Version”: “1.0.0”,
    “Endpoint”: “/dogs/search”,
    “Params”: “good dogs”
  },
  “User”: {
    “User_id”: “some_guid”,
    “Originating_Host”: “94.100.180.199”,
    “Remote_Host”: “10.0.0.35”,
    “User_Agent”: “dogsview”
  }
  “Message”: “oops”
}

What and Why are the hardest items to draw broad guidelines around. A lot of this is going to be system dependent. But the goal of emitting these messages in the first place is to provide useful clues to your future self. “Oops” doesn’t cut it. “Request failed” is only slightly better. HTTP does a reasonable job of taxonomizing the space here. The 2xx (this worked!), 4xx (this didn’t work, and it’s likely your fault!), 5xx (deepest apologies, but I have failed you!) etc series are already a (very coarse) breakdown of what happened. The specific errors try to get closer to why that thing happened (you aren’t authorized, I’m overloaded, slow down).

As a service engineer you can probably do even better. The goal is to write a message that gives you a clear path to taking action to fix the root cause. Note that latency is critical here. “I got a response!” may or may not equate to a happy client. “I got a response in 35ms!” is probably a better indicator. If you have an idea of who might have caused you to fail, call them out here!

{
  “time”: “2017-06-14T20:44:04+00:00”,
  “Location”: {
    “Region”: “us-east-1”,
    “Zone”: “us-east-1a”,
    “Host”: “10.0.0.4”,
    “Service”: “dogpics”,
    “Version”: “1.0.0”,
    “Endpoint”: “/dogs/search”
    “Params”: “good dogs”
  },
  “User”: {
    “User_id”: “some_guid”,
    “Originating_Host”: “94.100.180.199”,
    “Remote_Host”: “10.0.0.35”,
    “User_Agent”: “dogsview”
  }
  “Response Code”: 503,
  “Latency”: 5046,
  “Message”: “Deadline exceeded fetching results from search store 10.0.0.10”,
  “Suspect” “10.0.0.10”
}

This story is much improved. We can start aggregating these events to capture a high level view of customer experience. When we notice customer experience degrading, we have a wide range of dimensions we can start slicing on to find the needles in the haystack. Are we failing requests on all endpoints or just a handful? Are all regions affected? Does it happen for everybody or just Android users? And once you narrow in, you have individual events that stand alone.

The event described above is enough to be a repro case all by itself. These are the kinds of things you want to see when you’re in a cold sweat trying to get the site back up. Not “oops.”

Have thoughts on this post? Let us know via Twitter @honeycombio.