Honeycombers at LISA 2017

Did you go to LISA this year? I used to go back in the 1998-2003 timeframe (anyone remember playing the original Guitar Hero in that huge arcade in Seattle?) and I hope to make it back again someday soon. A lot of time has passed since those days, but the conference continues to offer attendees a wide range of useful and educational talks to choose from. In particular, the content on operating at scale has evolved upward much like the definition of “Large” since the conference’s inception :)

A couple of Honeycombers presented at LISA this year–here’s what they talked about:

Eben Freeman (@emfree): Queueing Theory in Practice: Performance Modeling for the Working Engineer

Although (like you, we assume) Eben doesn’t enjoy standing in lines, he does enjoy optimizing his way around them and then sharing what he’s learned. In this talk, he explains how to use the Universal Scalability Law to model system performance and make better capacity planning decisions. This can be complicated stuff, but don’t worry: he’s helpfully annotated his graphs for those of us who need that extra bit of guidance:

(Note: Although the video is about 45 minutes long, the talk is just 25 minutes of that–no need to budget quite so much time :))

Click through below to see the talk:

link to video

Ben Hartshorne (@maplebed): Sample Your Traffic but Keep the Good Stuff!

Ben did us all proud back here at the hive by wearing his bee feelers/antennae for his talk. As is the tradition of our people, I have created a high-quality gif to celebrate this occasion:

bee like ben gif

If you’re trying to keep the volume of your instrumentation data at a reasonable level but want to maintain a high level of actual observability, Ben’s got you covered with this talk. Don’t aggregate your data–bee smart and sample it!

Click through below to see the talk:

link to video

Have questions? Let us know at support@honeycomb.io

Best Practices for Observability

Observability has been getting a lot of attention recently. What started out as a fairly obscure technical term, dragged from the dusty annals of control theory, has been generating attention for one simple reason: it describes a set of problems that more and more people are having, and that set of problems isn’t well-addressed by our robust and mature ecosystem of monitoring tools and best practices.

In a prime example of “this may be frustrating and irritating, but this is how language works” — observability, despite arriving on the computer architecture scene much later than monitoring, turns out to actually be a superset of monitoring, not a subset.

Monitoring is heavily biased towards actionable alerts and black box service checks — which is not to deny the existence of a long tradition of diagnostics or white box monitoring, some of which turn out to fit better underneath the emerging definition of observability, some of which do not.

Observability, on the other hand, is about asking questions. Any questions. Open-ended questions. Frustrating questions. Half-assed descriptions of a vague behavior from a sleepy user half a world away types of questions. Do you have the data at your fingertips that will let you dive deep and observe the actual behavior the user is reporting, from her perspective, and draw a valid conclusion about what is happening and why? Then your observability is excellent, at least for that answer.

Typically in the past we have not had access to that type of information. Answering specific questions has been effectively impossible. For most of our history, this has frankly been due to storage costs. The late generations of observability tooling have not been made possible by the discovery of some fantastic new computer science, they have been made possible by cheaper storage costs and made necessary by escalating complexity and feature sets (and architecture decoupling) of the services we observe. Those trends were also made possible by the cheapening of hardware, so it’s Moore’s law all the way down.

In the past, all we could afford to look at and care about was the health of a system. And we bundled all our complexity into a monolith, to which we could attach a debugger in case of last result. Now we have to hop the network between functions, not just between us and third parties.

The time has come for a more open-ended set of mandates and practices. Monitoring has provided us with a rich set of starting points to mine for inspiration.

looking back through the rear-view mirror

Observability requires instrumentation

A system is “observable” to the extent that you can explain what is happening on the inside just from observing it on the outside.

Observability is absolutely, utterly about instrumentation. The delta between observability and monitoring is absolutely the parts that are software engineering-focused. The easiest and most dependable way is to instrument your own damn code.

(Some people will get all high and mighty here and say the only TRUE observability consists of sniffing the network. These people all seem to be network-sniffing observability vendors, but there’s probably no correlation there. IMO sniffing can be super awesome but tcpdump output is hard to wrangle, and the highest signal-from-noise ratio right now comes from developers pointing to a value and saying, “that one, print that one.” Obviously one should remember that this process is inherently imperfect too, but in my experience it’s the best we’ve got. Not EVERYTHING goes over the network, ffs.)

What? You didn’t write your own database?

You can’t instrument everything, though. Presumably most of us don’t write our own databases (cough), although they do tend to be well-instrumented. Pulling the data out can be non trivial, but it’s incredibly worth the effort. More on this in the best practices list below.

So yeah–you can’t instrument everything. You shouldn’t try. That’s a better game for cheap metrics and monitoring techniques. You should try to instrument the most useful and relevant stuff, the stuff that will empower you to ask rich, relevant questions. Like, instead of adding a counter or tick or gauge for everything in /proc (lol), focus on the high-cardinality information, timing around network hops, queries.

(A little of this will be a bit Honeycomb-specific (also somewhat Facebook-specific, with a splash of distributed tracing-specific), because these are the tools we have. Much like early monitoring manifestos annoyingly refer to “tags” and other graphite or time-series implementation specifics. Sorry!)

Guiding principles for observability

  • The health of each end-to-end request is of primary importance. You’re looking for any needle or group of needles in the haystack of needles. Context is critically important, because it provides you with more and more ways to see what else might be affected, or what the things going wrong have in common. Ordering is also important. Services will diverge in their opinion of where the time went.
  • The health of each high-cardinality slice is of next-order importance (for each user, each shopping cart, each region, each instance ID, each firmware version, each device ID, and any of them combined with any of the others.)
  • The health of the system doesn’t really matter. Leave that to the metrics and monitoring tools.
  • You don’t know what questions you’re going to have. Think about future you, not current you.

Best practices for observability

  • You must have access to raw events. Any aggregation that is performed at write time is actively harmful to your ability to understand the health and experience of each request.
  • Structure your logs/events. I tend to use “event” to refer to a structured log line. It can be either submitted directly to Honeycomb via SDK or API, or you can write it out to a log and tail it/stream it to us. Unstructured logs should be structured before ingestion.
  • Generate unique request IDs at the edge, and propagate through the entire request lifecycle (including to your databases, in the comments field)
  • Generate one event per service/hop/query/etc. A single API request should generate, for example, a log line or event at the edge (ELB/ALB), the load balancer (nginx), the API service, each microservice it gets passed off to, and for each query it generates on each storage layer. There are other sources of information and events that may be relevant when debugging (e.g. your DB likely generates a bunch of events that say how long the queue length is and reporting internal statistics, you may have a bunch of system stats stuff) but one event per hop is the current easiest and best practice).
  • Wrap any call out to any other service/data store as a timing event. In Honeycomb, stash that value in a header as well as a key/value pair in your service event. Finding where the system has gotten slow will usually involve either DT or comparing the view from multiple directions. For example, a DB may report that a query took 100ms, but the service may argue that it actually took 10 seconds. They can both be right …. if the DB doesn’t start counting time until it begins executing the query, and it has a large queue.
  • Incentivize the collection of lots of context, because context is king. Each event should be as wide as possible, with as many high-cardinality dimensions as possible, because this gives you as many ways to identify or drill down and group the events and other similar events as possible. Anything that puts pressure on your developer to collect less detail or select only a limited set of attributes to index or group by, is the devil. You’re hunting for unknown-unknowns here, so who knows which unknown will turn out to be the key?
  • Adopt dynamic sampling … from day one. To control costs, and prevent system degradation, and to encourage right-thinking in your developers. All operational data should be treated as though it’s sampled and best-effort, not like it’s a billing system in terms of its fidelity. This trains you to think about which bits of data are actually important, not just important-ish. Curating sample rates is to observability as curating paging alerts is to monitoring — an ongoing work of art that never quite ends.
  • When you can’t instrument the code — look for the instrumentation provided, and find ways of extracting it. For example, for mysql, we usually stream events off the wire, heavily sampled, AND tail the slow query log, AND run mysql command line commands to dump innodb stats and queue length. Shove em all into a dataset. Same for mongodb: at Parse we printed out all mongodb queries with debugLevel = 0 to a log on a separate block device, rotated it hourly, sampled heavily and streamed off to the aggregator… and we ran mongo from the command line and printed out storage engine statistics, queue length, etc and injected those into the same dataset for context.

Coming soon, the logical part 2: “Monitoring’s best practices are today’s observability anti-patterns.” Until then, give Honeycomb a try!

Debug Better By Throwing Information Away

The Addiction

Like many developers in today’s Brave New Distributed World, I’ve started to develop an addiction lately: I’m addicted to data. Data, whether it’s small or big or consultant big, is a critical make-or-break factor for businesses today. Once you figure out that you can store and analyze every interaction on the website or happening on your servers, it seems to be only a matter of collecting all the right details and turning the proper knobs to grow your app and ensure your status among the unicorns.

It therefore wouldn’t surprise me if the idea of losing some of that precious data is keeping you up at night.

Carl couldn't explain why the bounces bothered him so much. Little did he know
that User 485's had to take a burning pizza out of the oven at exactly the
moment before they converted, then suddenly thought better of the purchase.
There was nothing Poor Carl could have done.

The craving to collect data is especially strong for those of us tasked with keeping the system up, and for engineers who want to test their code in production the right way. The dream, of course, is to observe everything – to collect every drop of data we might need, and query it at a blazing fast rate. To divine outages before they happen. To blast through our systems like we’re using a Cerebro for code.

Do you even Cerebro?

It’s a good dream.

But pretty soon into our journey to become Debugging Geniuses…

…Reality intervenes.

It starts slowly. Maybe your home-grown centralized logging cluster becomes more difficult to operate, demanding unholy amounts of engineer time every week. Maybe engineers start to find that making a query about production is a “go get a coffee and come back later” activity. Or maybe monitoring vendors offer you a quote that elicits a response ranging anywhere from curses under the breath to blood-curdling screams of terror.

The multi-headed beast we know as Scale has reared its ugly visage.

As some of you may have already guessed from the title, I’m going to discuss one way to solve this problem, and why it might not be as bad as you might think.

Take some of your precious information and throw it in the garbage. In lots of cases, you can just drop those writes on the floor as long as your observability stack is equipped to handle it.

In other words, sample.

“Sample? Like they have at Costco?”

Well, this type of sampling is far less delicious, but arguably more rewarding. Although, now that I’m thinking about it, maybe you can pitch your boss to buy you new snacks with the money you’ll save…

What is sampling, then? It’s sending only a subset of the total collected information (such as events, which are JSON blobs describing what’s happening in your system) to your debugging tool. Using sampling, you can mimic having all of the data without entailing all of the costs of that data, e.g., the terabytes of storage needed (and subsequent horrendously slow query performance) if you were to store everything. In most systems, you can declare a static sample rate up-front and the system will take note of the fact that data is being sampled at this rate. In our product, Honeycomb, you can even set a per-event sample rate so that you can make sure not to lose important data like errors. More on that later.

“But… my precious data….”

Well, that’s fair. I hate settling for anything less than omniscience too.

But if you reflect on the problem, and try sampling out, you might find that with sampling you lose less important information than you might think. If you need to get an eye into something that’s going wrong, it’s likely to show up multiple times and/or be a persistent problem. Therefore, even when sampling heavily you’re likely to catch it eventually. And if it doesn’t show up again or cause major issues, then it’s one of many inevitable ephemeral blips in your application’s lifespan anyway.

One helpful analogy might be to think of sampling like JPEG compression. While technically “lossy”, the tradeoff is worth it, like in this example below (from Colt McAnlis’s blog). An almost indiscernable reduction in quality results in an image which is about 30% of the size, helping to slash the bandwidth and storage bill.

In Honeycomb’s case, you still have access to the raw data from the events that you do send – so you can continue to slice, filter, and deep dive with the Honeycomb workflow you know and love. Sampling therefore allows you to keep harmony between your storage quota, visibility into macro level trends, and an ability to dig into fine-grained details. your storage quota. Your queries will also run faster because the storage engine doesn’t have to churn through so many redundant rows.

And using Honeycomb, you can sample intelligently to keep what you care about the most. Let’s take a look.

Smart Sampling

Let’s say that you’re in charge of shepherding a high-traffic website or API. You probably have a lot of traffic that you don’t care about checking up on that much because frequently things are operating well or because the paths being exercised are not high value. On the flip side, you might have a subset of traffic that you need crystal clear insight into because it relates to core business functionality such as collecting payments, or it could be from customers of critical importance.

If we set a static sample rate (e.g., “Keep 1 out of every 5 requests”) we’d keep more of the boring stuff and lose more of the interesting anomalies.

Luckily, with Honeycomb events we can sample normal, boring events at a high rate (with a sample rate of N indicating that we’re keeping 1/N events) and keep all of the interesting bits. For instance, in this image below you can see a demonstration of dropping 99100 “boring” HTTP 200s that return in a reasonable amount of time, but keeping every HTTP 500-level response for our customers of high importance, or don’t meet our desired latency SLA.

We even open sourced an implementation of dynamic sampling techniques that can determine proper sample rates on the fly. You can simply set the fields you’d like to base the sampling on and let it rip.

Become a Debugging Genius

Using sampling, you’ll be able to get answers to questions faster. By querying faster, you’ll be able to try out more hypotheses, and ultimately become a better debugger. Using the techniques outlined above you should be able to separate the wheat from the chaff and mostly keep the golden data that you absolutely must hang onto.

Like a DJ cutting the bass on one track to cross-fade in another and keep the crowd grooving, you’re not losing effectiveness in your role by trimming some information. You’re gaining it! So don’t be afraid to give it a try – the documentation is available here. And as always, we’d love it if you give our Honeycomb free trial a whirl to see how event-based debugging can change the way you develop software!

Instrumenting browser page loads at Honeycomb

“Nines don’t matter if users aren’t happy” – my boss

As web applications have grown more complex (Responsive design! React! Redux!) and browser capabilities have grown more awesome (Web sockets! Prefetch/Prerender! Service Worker!), the mapping between a single http request and what the customer actually experiences when loading the page has gotten fuzzier and fuzzier. As a result, the best way to understand what our users experience when using our app is to instrument the browser directly.

The Problem

Unfortunately, browsers are usually ground zero for exactly the sorts of high-cardinality problems and cross-product problems that can crop up with traditional metrics approaches. As that person who cares about client-side instrumentation, I’ve repeatedly had this experience attempting to send my browser metrics to Graphite:

  1. Start capturing page load time metrics, with a key like:

    browser.performance.page_load_time
    
  2. Realize I want to slice and dice those by controller & action to understand which pages are slow, and start capturing page load time metrics with this key:

    browser.performance.[controller].[action].page_load_time
    
  3. Realize I also want to capture which browsers are slow, and start also capturing page load time metrics with this key:

    browser.performance.[operating system].[browser name].[browser version].page_load_time
    
  4. Realize I really need both together to find the true worst performance hot spots. Capture both of these:

    browser.performance.[controller].[action].[operating system].[browser name].[browser version].page_load_time
    browser.performance.[operating system].[browser name].[browser version].[controller].[action].page_load_time
    

    in Graphite and then create an Extremely Large Dashboard™ where I can scroll for many minutes, looking at each major browser, controller, and action combo to see which things are the slowest. Repeat across the various metrics I care about (time to first paint, dns resolution time, ssl handshake time, initial js parse & execute time…) until tired.

  5. Nervously wait by the phone for a call from The Metrics Team asking why I’m blowing up the keyspace with all these thousands of new Graphite keys. Hang up the phone and quietly slink off into the night.

bunny falls asleep at desk

These events have been dramatized for television, but you get the idea.

Enter Honeycomb

This is all a bit better with Honeycomb. Instead of all these combinations, I can send an event for each page load with OS, browser, controller, action, and various timings as keys. Rather than having to look at graphs of thousands of different metrics, we can use breakdowns in the Honeycomb query builder to find which browsers or pages are slow, and then drill in to figure out what the outliers have in common.

If that sounds a bit abstract, here’s how we do this at Honeycomb. We haven’t spent too much time on our browser instrumentation and our setup is fairly simple right now, so I’ve included the whole thing below.

The client-side bits

We collect page load stats at two points: right after the page loads, and right as the user leaves the page. This lets us collect data about both the initial page load and also the user’s experience interacting with the page — for example, whether they ran into any javascript errors and how long they kept the tab open.

This is the code we use to construct the “page-load” event:

// Send a user event to Honeycomb every time someone loads a page in the browser
// so we can capture perf & device stats.
//
// Assumes the presence of `window`, `window.performance`, `window.navigator`,
// and `window.performance.timing` objects
import _ from "underscore";
import honeycomb from "../honeycomb";

// Randomly generate a page load ID so we can correlate load/unload events
export let pageLoadId = Math.floor(Math.random() * 100000000);

// Memory usage stats collected as soon as JS executes, so we can compare the
// delta later on page unload
export let jsHeapUsed = window.performance.memory && window.performance.memory.usedJSHeapSize;
const jsHeapTotal = window.performance.memory && window.performance.memory.totalJSHeapSize;

// Names of static asset files we care to collect metrics about
const trackedAssets = ["/main.css", "/main.js"];

// Returns a very wide event of perf/client stats to send to Honeycomb
const pageLoadEvent = function() {
  const nt = window.performance.timing;

  const event = {
    type: "page-load",
    page_load_id: pageLoadId,

    // User agent. We can parse the user agent into device, os name, os version,
    // browser name, and browser version fields server-side if we want to later.
    user_agent: window.navigator.userAgent,

    // Current window size & screen size stats
    // We use a derived column in Honeycomb to also be able to query window
    // total pixels and the ratio of window size to screen size. That way we
    // can understand whether users are making their window as large as they can
    // to try to fit Honeycomb content on screen, or whether they find a smaller
    // window size more comfortable.
    //
    // Capture how large the user has made their current window
    window_height: window.innerHeight,
    window_width: window.innerWidth,
    // Capture how large the user's entire screen is
    screen_height: window.screen && window.screen.height,
    screen_width: window.screen && window.screen.width,

    // The shape of the current url, similar to collecting rail's controller +
    // action, so we know which type of page the user was on. e.g.
    //   "/:team_slug/datasets/:dataset_slug/triggers"
    path_shape: document.querySelector('meta[name=goji-path]').content,

    // Chrome-only (for now) information on internet connection type (4g, wifi, etc.)
    // https://developers.google.com/web/updates/2017/10/nic62
    connection_type: navigator.connection && navigator.connection.type,
    connection_type_effective: navigator.connection && navigator.connection.effectiveType,
    connection_rtt: navigator.connection && navigator.connection.rtt,

    // Navigation (page load) timings, transformed from timestamps into deltas
    timing_unload_ms: nt.unloadEnd - nt.navigationStart,
    timing_dns_end_ms: nt.domainLookupEnd - nt.navigationStart,
    timing_ssl_end_ms: nt.connectEnd - nt.navigationStart,
    timing_response_end_ms: nt.responseEnd - nt.navigationStart,
    timing_dom_interactive_ms: nt.domInteractive - nt.navigationStart,
    timing_dom_complete_ms: nt.domComplete - nt.navigationStart,
    timing_dom_loaded_ms: nt.loadEventEnd - nt.navigationStart,
    timing_ms_first_paint: nt.msFirstPaint - nt.navigationStart, // Nonstandard IE/Edge-only first paint

    // Some calculated navigation timing durations, for easier graphing in Honeycomb
    // We could also use a derived column to do these calculations in the UI
    // from the above fields if we wanted to keep our event payload smaller.
    timing_dns_duration_ms: nt.domainLookupEnd - nt.domainLookupStart,
    timing_ssl_duration_ms: nt.connectEnd - nt.connectStart,
    timing_server_duration_ms: nt.responseEnd - nt.requestStart,
    timing_dom_loaded_duration_ms: nt.loadEventEnd - nt.domComplete,

    // Entire page load duration
    timing_total_duration_ms: nt.loadEventEnd - nt.connectStart,
  };

  // First paint data via PerformancePaintTiming (Chrome only for now)
  const hasPerfTimeline = !!window.performance.getEntriesByType;
  if (hasPerfTimeline) {
    let paints = window.performance.getEntriesByType("paint");

    // Loop through array of two PerformancePaintTimings and send both
    _.each(paints, function(paint) {
      if (paint.name === "first-paint") {
        event.timing_first_paint_ms = paint.startTime;
      } else if (paint.name === "first-contentful-paint") {
        event.timing_first_contentful_paint_ms = paint.startTime;
      }
    });
  }

  // Redirect count (inconsistent browser support)
  // Find out if the user was redirected on their way to landing on this page,
  // so we can have visibility into whether redirects are slowing down the experience
  event.redirect_count = window.performance.navigation && window.performance.navigation.redirectCount;

  // Memory info (Chrome) — also send this on unload so we can compare heap size
  // and understand how much memory we're using as the user interacts with the page
  if (window.performance.memory) {
    event.js_heap_size_total_b = jsHeapTotal;
    event.js_heap_size_used_b = jsHeapUsed;
  }

  // ResourceTiming stats
  // We don't care about getting stats for every single static asset, but we do
  // care about the overall count (e.g. which pages could be slow because they
  // make a million asset requests?) and the sizes of key files (are we sending
  // our users massive js files that could slow down their experience? should we
  // be code-splitting for more manageable file sizes?).
  if (hasPerfTimeline) {
    let resources = window.performance.getEntriesByType("resource");
    event.resource_count = resources.length;

    // Loop through resources looking for ones that match tracked asset names
    _.each(resources, function(resource) {
      const fileName = _.find(trackedAssets, fileName => resource.name.indexOf(fileName) > -1);
      if (fileName) {
        // Don't put chars like . and / in the key name
        const name = fileName.replace("/", "").replace(".", "_");

        event[`resource_${name}_encoded_size_kb`] = resource.encodedBodySize;
        event[`resource_${name}_decoded_size_kb`] = resource.decodedBodySize;
        event[`resource_${name}_timing_duration_ms`] = resource.responseEnd - resource.startTime;
      }
    });
  }

  return event;
};


// Send this wide event we've constructed after the page has fully loaded
window.addEventListener("load", function() {
  // Wait a tick so this all runs after any onload handlers
  setTimeout(function() {
    // Sends the event to our servers for forwarding on to api.honeycomb.io
    honeycomb.sendEvent(pageLoadEvent());
  }, 0);
});

And here’s the code we use to construct the “page-unload” event, which runs when the user closes the tab or navigates way from the current page:

// Send a user event to Honeycomb every time someone closes or navigates away
// from a page in the browser, so we can capture stats about their usage of a
// particular page.
//
// Assumes the presence of `window`, `window.performance`, and
// `window.performance.timing` objects
import honeycomb from "../honeycomb";

// Import these numbers from our earlier "page-load" event so we can correlate
// load and unload events together and check memory usage increases.
import { pageLoadId, jsHeapUsed } from "./page_load";

// Capture a _count_ of errors that occurred while interacting with this page.
// We use an error monitoring service (Sentry) as the source of truth for
// information about errors, but this lets us cross-reference and ask questions
// like, "are we ever failing to report errors to Sentry?" and "was this user's
// experience on this page potentially impacted by JS errors?"
const oldOnError = window.onerror;
let errorCount = 0;
window.onerror = function() {
  // call any previously defined onError handlers
  if (oldOnError) { oldOnError.apply(this, arguments); }
  errorCount++;
};

// Returns a wide event of perf/client stats to send to Honeycomb
const pageUnloadEvent = function() {
  // Capture how long the user kept this window or tab open for
  const openDuration = (Date.now() - window.performance.timing.connectStart) / 1000;

  const event = {
    page_load_id: pageLoadId,
    error_count: errorCount,
    user_timing_window_open_duration_s: openDuration,
  };

  // Memory info (Chrome) — also send this on load so we can compare heap size
  // and understand how much memory we're using as the user interacts with the page.
  if (window.performance.memory) {
    event.js_heap_size_used_start_b = jsHeapUsed;
    event.js_heap_size_total_b = window.performance.memory.totalJSHeapSize;
    event.js_heap_size_used_b = window.performance.memory.usedJSHeapSize;
    event.js_heap_change_b = window.performance.memory.usedJSHeapSize - jsHeapUsed;
  }

  return event;
};

// Only attempt to send stats if the browser is modern enough to have nav timing
window.addEventListener("pagehide", honeycomb.sendEvent(pageUnloadEvent()));

The server-side bit

Our primary customer-facing web app, Poodle, has an endpoint that forwards these events to Honeycomb’s ingestion API endpoint. Sending events through Poodle lets us keep our Honeycomb write key private (we don’t recommend exposing it in the browser for now since write keys allow you to read markers and create new datasets) and also allows us to add extra metadata using our server-side instrumentation code. For example, since Poodle knows the currently logged-in user and the team they are viewing, we can add fields like user_id , user_name, user_email, team_id, and team_name to our events before sending them to Honeycomb.

We send these events to the same Honeycomb dataset that we use for our product analytics. In addition to allowing us to reuse our proxy endpoint logic, it means that we can query the dataset for both technical and behavior questions about the Honeycomb web experience and more easily look for correlations between the technical details like device type, performance, errors and customers’ product usage (number of queries run, triggers created, boards shared).

Here’s the handler function we use to forward these events to the Honeycomb API:

func (h *UserEventsHandler) sendToHoneycombAPI(eventType string, metadata map[string]interface{}, user *types.User) {
    ev := h.Libhoney.NewEvent()
    ev.Dataset = "user-events"      // Name of the Honeycomb dataset we'll send these events to
    ev.AddField("type", eventType)  // Name of the type of event, in our case either "page-load" or "page-unload"
    ev.Add(metadata)                // All those event fields we constructed in the browser

    // And then we add some fields we have easy access to, because we know the
    // current user by their session:
    ev.AddField("user_id", user.ID)
    ev.AddField("user_email", user.Email)

    // Send the event to the Honeycomb API (goes to our internal Dogfood
    // Honeycomb cluster when called in Production).
    ev.Send()
}

The Libhoney.NewEvent() call will also pick up any global metadata we’ve added to send to Honeycomb, so we’ll get extras like build ID and environment for free.

How it’s going

From these three files, we get some relatively actionable data. Here’s a board of the queries I look at the most:

browser instrumentation board

My favorite graph so far is this one of our javascript and css bundle sizes. (It just needs markers so we can correlate these changes to specific deploys.)

js and css bundle sizes

Using the data in this graph, I can set up a trigger in the honeycomb UI and get an email or slack alert the next time I accidentally import “lodash” instead of “underscore” in our JS and inadvertently deploy another 20k of dependencies. (Whoops.)

washed away

We can also use it as a quick and dirty way of enforcing performance budgets.

js and css bundle sizes

What next?

With a few hundred lines of code, we already have good visibility into what our customers are experiencing in the browser. But there’s a lot more we could start collecting without too much trouble, if we got curious:

On page load:

  • Active feature flags and their values
  • Referring search term, if on a search result page
  • Browser capability support. Using a library like Modernizr, we could check to see what percentage of our customers’ browsers support features we’d like to use (emoji, video element, web workers, css grid, etc.) to help us decide if we can use these features in our code.

On page unload:

Any custom user timings. For example, we might want to time how long it takes to fetch and render data for async-loaded UI elements, but not hold up sending the initial page load event to capture these.

New events:

There are some significant actions users can take while interacting with a page that may warrant their own event. For example, we know want to time how long it takes to run a query, poll for the result, and render the resulting graph on screen. We already send an analytics event whenever a user runs a new query, but we may want to add those performance timings (initial request, polling duration, render duration, overall user-perceive wait time) to the event so we can start to understand how query performance affects customers’ usage of queries.

Server-side:

  • Better user agent to device type, operating system, and browser mappings. We don’t currently use a library to transform UA strings to friendly names, but we should. Using “browser name” and “browser version” in Honeycomb breakdowns would help us spot browser-specific performance issues more quickly.
  • Geo-IP mapping. Using a library to map users’ IP addresses to approximate location could help us better understand app performance for customers in Europe, Australia, etc.

Let us know

This post covers collecting general-purpose data that would be relevant to many web applications, but as you continue to to instrument, you’ll likely find more app-specific and domain-specific bits of your experience to capture in your events too.

Happy instrumenting! If you run into any interesting use cases or stories in your browser instrumentation journey, we want to hear about it: support@honeycomb.io