Event Foo: What Should I Add to an Event?

When we’re talking with people about how they should start using Honeycomb, many ask for guidance about what should go into an event. Though there are longer posts on this blog about what it means to be an event, this one is a “short” list of things to consider when you’re building events.

What is actually useful is of course dependent on the details of your service, but most services can get something out of these suggestions. As a point of reference, the Honeycomb front end web server generates events with an average of 40 fields (±20), and our API server is closer to 70 (±30). Building and collecting wide events (with many fields) give you the context you’ll need later when trying to understand your production service.

Broad principles

  • Add redundant information when there’s an enforced unique identifier and a separate column that is easier for the people reading the graph to understand. For example, at Honeycomb, the Team ID is globally unique, and every Team has a name. We add the ID to get a unique breakdown and add the Name so that it’s easier to recognize (“honey” is easier to remember than “122”).
  • Add two fields for errors - the error category and the returned error itself, especially when getting back an error from a dependency. For example, the category might include what you’re trying to do in your code (error reading file) and the second what you get back from the dependency (permission denied).
  • Opt for wider events (more fields) when you can. It’s easier to add in more context now than it is to discover missing context later.
  • Don’t be afraid to add fields that only exist in certain contexts. For example, add user information if there is an authenticated user, don’t if there isn’t. No big deal.
  • Think about your field names some but don’t bikeshed (http://bikeshed.com/). Common field name prefixes help when skimming the filed list since they’re alphabetized.
  • Add units to field names, not values (such as parsing_duration_µs or file_size_gb).

Ok, let’s talk about specific fields!!

Who’s talking to your service?

  • Remote IP address (and intermediate load balancer / proxy addresses)
  • If they’re authenticated
    • user ID and user name (or other human-readable identifier)
    • company / team / group / email address / extra information that helps categorize and identify the user
  • user_agent
  • Any additional categorization you have on the source (SDK version, mobile platform, etc.)

What are they asking of your service?

  • URL they request
  • Handler that serves that request (such as rails route or goji handler or django view or whatever it’s called these days.)
  • Other relevant HTTP headers
  • Did you accept the request? or was there a reason to refuse?
  • Was the question well formed? (or did they pass you garbage as part of the request)
  • Other attributes of the request (was it batched? gzipped? if editing an object, what’s that object’s ID? etc.)

How did your service deal with the request?

  • How much time did it take?
  • What other services did your service call out to as part of handling the request?
  • Did they hand back any metadata (like shard, or partition, or timers) that would be good to add?
  • How long did those calls take?
  • Was the request handled successfully?
  • Other timers (such as around complicated parsing)
  • Other attributes of the response (if an object was created, what was its ID?, etc.)

Business-relevant fields

Obviously optional, as this type of information is often unavailable to each server, but when available it’s surprising how useful it can be at empowering different groups to easily use the data you’re generating. Some examples:

  • Pricing plan - is this a free tier, pro, enterprise? etc.
  • Specific SLAs - if you have different SLAs for different customers, including that info here can let you issue queries that take it in to account.
  • Account rep, business unit, etc.

Additional context about your service / process / environment

  • Hostname or container ID or …
  • Build ID
  • Environment, role, and additional environment variables
  • Attributes of your process, eg amount of memory currently in use, number of threads, age of the process, etc.
  • Your broader cluster context (eg AWS availability zone, instance type, kubernetes pod name, etc.)

Above all

Just start. You can always add fields later. Use your events to make your life easier, and have fun along the way :)

Use Derived Columns To Prioritize Development Work

We recently released a new feature for Honeycomb: derived columns and we promised at the time that we’d show you some more examples of how it can make your life easier. Here are a couple that are about helping you figure out wtf you should do next:

What SDK(s) should we work on the most/next?

We’ve got so much to do, but we can’t do it all at once. You know how it is. Sometimes, you’ve got to prioritize things.

Using a derived column, we can look at the contents of user_agent from data our customers send us and use it to figure out which SDK most people are using. If there’s a definite winner or winners, we should most likely tackle those first when adding features or fixing bugs.

(using the derived column sdk = REG_VALUE($user_agent, "libhoney-[a-z]*"))

And here’s the resulting output:

Oh, hm–those two Manticore entries are not really what we want here–Manticore is a Ruby HTTP library that’s used internally by our logstash plugin, but we don’t develop it, so let’s filter out anything that’s not really an SDK:

Ah, much better. (Aside: whoah, what happened with that libhoney-go/1.3.0 version?) In any case, libhoney-go/1.3.3 is the clear winner.

When can we safely deprecate support of old API versions?

Older versions of APIs can be slow or buggy, so we do what we can to encourage our users to upgrade to the latest versions of our logstash plugins (and stop using the associated out-of-date API). However, they don’t necessarily write to tell us when they do (jerx! ;)). Have no fear, though–we can use a derived column to check and see if anyone is still left behind and if not–begin the process of decommissioning that old plugin.

(using the derived column logstash_plugin_version = REG_VALUE($extra_headers, "X-Plugin-Version:[0-9\\.]*"))

And the resulting output: (Note: The unnamed plugin at the bottom is all the traffic that doesn’t come from logstash.)

Hmm. Doesn’t look as though most people have upgraded to the new version yet, so we can’t really deprecate the old one. Guess we need to spend a little more time reminding people to try the new version…

Event Foo: Building Better Events

This post from new Honeycomber Rachel Perkins is the seventh in our series on the how, why, and what of events.

An event is a record of something that your system did. A line in a log file is typically thought of as an event, but events in Honeycomb can be a lot more than that—they can include data from different sources, fields calculated from values from within, or external to the event itself, and more.

An event represents a unit of work. It can tell a story about a complete thing that happened–for example, how long a given request took, or what exact path it took through your systems. What is a unit of work in your service? Can you make your events better?

In this blog post, we’ll walk through a simple story in which the protagonist (hey, that’s YOU!) improves the contents of their events to get more information about why their site is slowing down.

Where should I start?

When looking to get more visibility into the behavior of systems, we often begin by thinking about how long key tasks take. A common starting approach is to focus on the things that tend to be the most expensive. For example you might:

  • Instrument MySQL calls with a timer (how long are these calls taking?)
  • Instrument callouts to, say, Kafka with a timer as well.

To achieve this in Honeycomb (How to make and send events), you:

  • Build an event with a timer field called request_roundtrip_dur with the MySQL and Kafka timers in it.
  • Send that data to Honeycomb.

Pro-tip: Name your event fields in a consistent way to make it easier to pick different types of things out from the column list. For example, use _dur in anything that’s a duration. We recommend you put the unit (_ms, _sec, _ns, _µs, etc.) in there too—you may be using units consistently now, but you know how that can go sideways fast.

For a while, the site rocked pretty steady:

Your site's performance is steady

So you keep on keeping on until….

À la recherche du temps perdu

Oncall has been pretty chill, but a few days ago the team rolled out some new code, and things seem to get a bit slower. Time to dig in to find out what happened. First, let’s see how long stuff overall is taking:

Things are slowing down a little

Things are getting slower overall, but it’s hard to see whether the cause is hiding in one of your existing timers. You want to find out if there is an obvious new culprit that needs a timer added—one way to do this is to set up a derived column that adds up all your existing timers and subtracts them from the average request time:

building a derived column to add up timers

Looking at your new derived column, it’s clear that the parts of this event that are slow are not currently being timed—77ms are missing entirely from our existing timers! (Click through for the detail):

the derived column adds up the timers

It’s time to add a new timer around the next few obvious culprits, so you decide to see if it’s the JSON parsing that’s contributing this extra time:

  • Update your event with JSON parsing timing.

Stay on target

After a few days, you can tell that the JSON parsing accounts for a small part of the additional slowness, but not the majority of it, and it’s not taking significantly longer than it was a week or two ago:

Obviously the new code in the push is suspect, so it’s time to start looking at that in more detail. You know that changes were made to the auth module and to some validation performed by another piece of code so you start there and:

  • Add timings to the auth and validation subsystems.

AHA!

That change to the auth system broke caching—you hadn’t set up a timer there because it had always been fast from the cache before. Now you can see how that one issue impacted the overall system performance.

Of course there’s still a little bit of unknown time being spent, but you can put off investigating that for another day and get on with filing a ticket on the auth module caching :)

Introducing the New Honeycomb Quick Start

Today we are pleased to announce the release of the new Honeycomb Quick Start to help you in your quest to become an observability master.

In case you’re unfamiliar, Honeycomb is a tool to help you debug complex systems such as databases, distributed infrastructure, containers, microservices, and more. It improves upon existing tools such as static dashboards and pre-aggregated metrics because it encourages an interactive workflow where locating “needles in haystacks” is easier because critical information is not being thrown away.

To start using Honeycomb, the steps are:

  1. Send “events” to Honeycomb in the form of JSON
  2. Visualize these sent events in the Honeycomb web UI to identify areas of interest
  3. Continue to iterate on initial queries to deduce the source of your issue(s)

Because we love you :), we’re working diligently to help you understand how you can get value out of Honeycomb. The new Honeycomb Quick Start will get you from zero knowledge to querying data as quickly as possible. Follow along at home and start learning what this observability thing is all about!

What’s observability all about?

In the quick start, we walk you through digging into a common problem: your users have reported that the website is slow, and you are trying to figure out what might be the source of the issue. Honeycomb is uniquely positioned as a tool to help you pin-point the source of these kinds of maddeningly open-ended questions.

After going through the Quick Start, you’ll be able to make and interpret charts like the following using the Honeycomb UI, and will have a much better feel for what you are aiming for when instrumenting code, sending data, and making queries in Honeycomb. The end result? You’ll have more reliable infrastructure, more empowered engineers, and happier users.

Try Honeycomb today

Sign up for Honeycomb today or go directly to the Quick Start tutorial (if you already have an account) to get started on your journey to become an observability master. If you encounter any issues or have questions, please feel free to reach out to us at support@honeycomb.io