The time has finally come–you can now run the Honeycomb AWS Bundle with a high availability setup! No more anxiously hoping your instance stays up and if it doesn’t, that you can restart rapidly!

image of a cupcake with a hurray flag

Let’s talk about the changes:

How state is maintained

Let’s use Elastic Load Balancer log ingestion as an example - honeyelb, the component of the Honeycomb AWS Bundle which processes ELB logs, works by asking S3 for the last hour of logs from your specified ELBs.

Here’s how it works without high availability:

  • honeyelb references a local state file with a list of processed S3 logs stored by their S3 AWS path and time stamp
  • If the log is not found in the state file, it is downloaded, parsed, and an event is sent to Honeycomb
    • The log path and timestamp is added to the processed objects file
  • honeyelb publishes these events and waits for new logs to arrive in the S3 bucket

In this setup (without high availability), there’s no way for agents running on different instances to ensure that a log hasn’t already been processed. Since the file of processed objects is stored locally on each instance, we can’t check across multiple agents. What happens if one of your agents goes down? Or the box OOMs and restarts? If that happens, we’ve lost all records of what objects have already been processed!

How do we fix this? What if we could move the state that maintains processed objects out of the instances themselves?

Using DynamoDB for High Availability

We decided to solve this problem by using a global conditional write/lock system. DynamoDB within AWS provides the backend we need to make this happen. Logs are written to DynamoDB where they are processed and can thus be referenced from any Honeycomb AWS Bundle agent.

Entries to DynamoDB have a TTL (Time to Live) of 7 days, but by default maintain a backfill of 1 hour. This means that an agent is looking for a max of one hour on S3 logs and after 7 days, the entries into DynamoDB expire - this way your DynamoDB table will not fill endlessly. You can also specify how long you’d like the backfill to be, up to 7 days.

./honeyelb --highavail --writekey={{ Your Writekey }} --dataset="{{ Your DataSet }}" --backfill=120  ingest {{ Your LoadBalancers }}

INFO[2018-02-12T12:29:37-05:00] State tracking with high availability enabled - using DynamoDB
INFO[2018-02-12T12:29:38-05:00] Attempting to ingest LB                       lbName=shepherd-dogfood-lb
INFO[2018-02-12T12:29:38-05:00] Access logs are enabled for ELB ♥             bucket=honeycomb-elb-access-logs lbName=shepherd-dogfood-lb
INFO[2018-02-12T12:29:38-05:00] Getting recent objects                        entity=shepherd-dogfood-lb prefix=shepherd-dogfood/AWSLogs/702835727665/elasticloadbalancing/us-east-1/2018/02/12/702835727665_elasticloadbalancing_us-east-1_shepherd-dogfood-lb
INFO[2018-02-12T12:29:38-05:00] Downloading access logs from object           entity=shepherd-dogfood-lb from_time_ago=17h28m20.822074s key=shepherd-dogfood/AWSLogs/702835727665/elasticloadbalancing/us-east-1/2018/02/12/702835727665_elasticloadbalancing_us-east-1_shepherd-dogfood-lb_20180212T0000Z_107.23.163.113_3bgrbnrm.log size=5887379

You can Chaos Monkey away and now your logs will still be processed!

gif of a windup monkey

Wrapping up

With high availability now an option, Honeycomb AWS agents can run on multiple clients and all communicate to the same source of truth. For more on requirements to run —highavail and setup details, check out the README. If this has piqued your interest in Honeycomb, sign up for a free trial!

Yes, being on call typically and anecdotally sucks. I understand! If you’ve heard me speak, I often point out that I’ve been oncall since I was 17 years old—so I know how terrible it can be.

grumpycat hates oncall and her coworkers

But I believe strongly that it doesn’t have to be. Oncall can and should be different. It can be like being a superhero—if you’re on call and an issue comes up, you should get to feel like you’re saving the world (or at least your users), in a good way. It shouldn’t eat your life, or have a serious negative impact on your day-to-day interactions and personal relationships. It just shouldn’t suck that much.

Let’s explore this possibility. The first questions to ask are: Why does on call even exist? Should it exist? What are its goals, and what is out of scope?

If we agree it should exist, then how can we do it well? How can oncall become a thing that is pedestrian, part of life, something that is possible to love, or at least tolerate?

What even is oncall?

What oncall should be about in most environments:

  • Bring the service up if it goes down
  • Respond to users when they complain (or do so by proxy via Customer Support)
  • Triage bug reports and escalate, wontfix or fix
  • Make small fixes (that are scoped to some threshold, maybe < 1 day or < 2 hours)
  • Do garbage collection tasks from a grab bag of work that doesn’t happen often enough to be worth fully automating or hiring a team for, requires human judgment, is a one-off, is critical and urgent
  • Represent back to management when the burden is becoming too heavy

Why are there on call rotations? Because it forces everyone to learn. It is terrible for just a single person to know how to do a given thing; terrible for them, terrible for everyone else. We have oncall rotations for same reason that we round-robin DNS for our API: it’s about resiliency, and ultimately, sustainability. On call rotations are the primary mechanism by which most teams address the question of sustainability.

Services need owners, not operators

It really does take a village.

Responsibility for tending to brand new, baby code is too important, too hard, too all-consuming to be one person’s job. It will break them, burn them out so fast and so hard until all they want to do is drown that code in the bathtub. (Babies, by the way, are engineered by evolution to be so cute that you won’t kill them no matter how they scream. Your code is not this cute, I guarantee you.)

To build something is to be responsible for it. Writing software should only be done as a very last resort. Great senior engineers do everything possible to avoid having to write new code.

philosoraptor meme saying if you write software and no one uses it, does it actually exist?

Developing software doesn’t stop once the code is rolled out to production. You could almost say it starts then. Until then, it effectively doesn’t even exist. If users aren’t using your code, there’s no need to support it, to have an oncall rotation for it, right? The thing that matters about your code is how it behaves for your users, in production—and by extension, for the people who are oncall for your code running in production.

The craft of building software and the craft of owning it over its lifetime are effectively the same thing. The core of the craft of software design and engineering is about maintainability. These things should continue to matter to you after your code is in production.

If I ask a room of folks who among them works on building software, and then ask the same room which of them are part of the oncall rotation, I should see about the same people raising their hands. But I typically don’t see the same people raising their hands because their experience of oncall sucks (see above), so software engineers believe they don’t want to do it.

So how do we make it not suck (so much)?

On call is a relentlessly human-scale process. It is never “fixed” or “done” any more than managing a team is done. It requires eternal vigilance and creativity and customization to your specific resources and the demands upon them at any given time. There are no laws I can pass down that if you only apply them will make everything better. It is contextual and contingent at its very heart.

But we do know that nothing is sustainable if it is hateful to everyone who participates in it. Sustainable processes are low impact and use renewable resources. Burning people out is not sustainable. Making them dread being oncall is not sustainable. We have to make it suck less.

Be(e) the oncall you want to see in the world

If you’re leading a team responsible for oncall, set an example yourself.

Empower your team to get the work done, then value their work, their time, their personal lives. Make people mobile; provide them with whatever hardware they need to do their work on the go. Show them the impact they have on the service, on customer happiness. If you can, shrink the feedback loop so your team can internalize the impact of their own choices and outcomes.

Pair people up as primary/secondary so there’s very little friction to asking someone to take over for you for a few hours. Send people home or tell them not to come in after a rough night. Support them in making those decisions for themselves. Never page new parents, or anyone who is already being woken up by something. Never force someone to carry a pager. Create incentives and social pressure and team cohesion, and they will want to.

Help your team connect on a human level. Share food with each other, try to spend time together and share things that aren’t software. Make it easier for them to ask for help from you and each-other.

Most of all, ensure you and your senior people are modeling this behavior; don’t just say it’s encouraged with your mouth. Nobody believes you ‘til they see it.

Care more about less stuff

Your team can’t handle everything all the time, so be ruthless in prioritizing, reducing responsibilities, setting expectations with the rest of your organization. Consider lowering your standards! Have reasonable expectations; set them as low as you possibly can.

Guard the critical path, architect for resiliency. Have daytime alerts and few night time alerts (and audit these often). Only page on something if the customer is going to notice it. Let lots of things fail until morning. Protect your team from unreasonable expectations.

Above all, pay attention and be outcome-oriented.

Paying attention? Here’s what to look for

Signs it’s bad:

grumpycat hates oncall and her coworkers
  • People dread it
  • People plan their lives around it
  • People cancel things
  • People talk about it a lot
  • Ghoulish gallows humor
  • People complain about things being unfair

Signs it’s going well:

grumpycat hates oncall and her coworkers
  • People cheerfully and actively volunteer to cover for each other
  • People ask to be covered for like it’s no big deal
  • People don’t keep score obsessively about who owes who
  • Someone gets woken up 1-2 times a week at most
  • People can freely self-select which rotation to join
  • People move among rotations often
  • People can just take their laptop and wifi with them to the concert, camping, whatevs

The glorious future of being on call

One of the crappiest things about being oncall in the current reality for most folks is that you’re stuck getting paged for the same old shit every rotation, applying the same workarounds, restarting the same services. Instead, being on call should be about solving new and different problems, being a superhero—not about being the janitor.

Oncall can and should be a break in routine, freedom from the daily tedium of incremental progress factoring widgets, an opportunity to fix all the little things that are bothering you, to give your teammates and users sparks of delight. It can be a breather where you’re encouraged to investigate and pay down technical debt, learn new things, make your team-mates’ lives a little better.

And you want your oncall team to be freed up from handling the known-knowns, and the known-unknowns—because as complexity continues to accelerate, you need them fresh and bright-eyed and bushy-tailed to handle all those unknown-unknowns coming our way.

Hoping to see you someday in the glorious future of being oncall :)

This guest blog post from Anton Drukh of snyk.io is part of our series on structured logging.

From 1 service to over 50 today

We run a distributed engineering team at Snyk. As we scale our team and infrastructure, we want to make sure that we don’t make a mess of the daily operation our system. We don’t want to fall into the trap of committing code that makes sense when it’s written, but not when it’s run in production.

We started with a single service in our stack. Two years into adopting a micro-services architecture, we span over 50 services. One aspect of this growth has been a constant replication of best practices throughout our team and codebase. We want to make sure that what helps us succeed as a team is ingrained in our code.

In this post, I’d like to show how we embed trust and transparency into our services, reflecting these qualities in the way we work as a team.

For example, you just finished working on a feature and pushed it to production. How likely is it that another team member can effectively troubleshoot this feature the following day? Does the effort involved in troubleshooting grow as the system grows, and require more context to reason with? What are you doing to fight this, and is your system designed in a way that helps you?

A big part of the answer to this is in your logs. Teach your code to communicate with your team, and you will reap the benefits for as long as the code lives!

cabin made of logs

But we already do logging!

To support the examples in this post, I published a simple demo server that is very close to a boilerplate for a microservice in our stack. It is based on koa and bunyan; popular JS packages that we use extensively on the team. Please try it out yourself!

Our logs conform to a set of rules, supporting a strong standard for our team that helps us understand what is going on with any service in our production environment:

  • Each request hitting a server is logged at least twice (start and end).
  • Each request being handled by the server gets assigned with a request ID that accompanies all its logs.
  • Each response sent by the server is logged with:
    • a constant string message ('Reply sent').
    • a log level identifying the response status (eg info for 2xx, warn for 4xx, error for 5xx).
    • a duration for which the request was being processed by the server.

The server has a single API endpoint, and most of the logic around it is about logging. The logging scheme produces the following output for each request hitting the server:

{"name":"demo-service","hostname":"foo-bar-host","pid":58465,"requestId":"5251fd1a-a41c-4bd7-be1d-c121ef38ebea","level":30,"query":{},"originalUrl":"/","method":"GET","msg":"Handling request","time":"2018-01-27T16:07:51.505Z","v":0}
{"name":"demo-service","hostname":"foo-bar-host","pid":58465,"requestId":"5251fd1a-a41c-4bd7-be1d-c121ef38ebea","level":30,"sleepDurationInMs":100,"msg":"Going to sleep for a while","time":"2018-01-27T16:07:51.505Z","v":0}
{"name":"demo-service","hostname":"foo-bar-host","pid":58465,"requestId":"5251fd1a-a41c-4bd7-be1d-c121ef38ebea","level":30,"sleepDurationInMs":100,"msg":"Woke up!","time":"2018-01-27T16:07:51.611Z","v":0}
{"name":"demo-service","hostname":"foo-bar-host","pid":58465,"requestId":"5251fd1a-a41c-4bd7-be1d-c121ef38ebea","level":30,"query":{},"originalUrl":"/","method":"GET","duration":110,"status":200,"msg":"Reply sent","time":"2018-01-27T16:07:51.613Z","v":0}

That’s a bit hard to read for us humans, but knowing our logging standard, you can see the following:

  • Each message is a single-line JSON… Yummy for machine parsing!
  • Each message has the same requestId value of 5251fd1a-a41c-4bd7-be1d-c121ef38ebea.
  • Each message has a log level of 30 (info in bunyan-speak).
  • The first message has msg: 'Handling request', and the last has msg: 'Reply sent' - these are the basic two logs per request I mentioned.
  • The last message has duration: 110, indicating the processing time of this request in milliseconds.

What seems like a lot of logging for a few lines of code is actually observability heaven. The following important questions are easy to answer once these logs are parsed by an ELK stack:

  • How many requests have been served in the last 24 hours?
  • What are the 10 most frequent errors occurring for a specific request type?
  • What are the 99th, 95th and median percentiles for duration of request handling by this service?
  • What percentage of requests are failing with 4xx status codes? 5xx status codes?
  • Does the service’s error rate correlate with time-of-day? Count of requests served?
  • Let’s say a support request came in from a user stating that they experienced a service outage at approximately 2pm today. What errors occured during that time? Can we identify the initial request made by that user, and trace it through all of our services to understand what went wrong?

Ok, that’s a deeper-than-usual level of logging, but why, really?

I’ve been told that not everyone is as excited about dashboards and automated alerts as I am, so allow me try and convince you too - if all of your code conformed to this logging standard, how easy would it be to troubleshoot it in production?

Let’s see an error case, included in the demo server example. Making an error-inducing request results in the following log (expanded for readability):

{
  "name": "demo-service",
  "hostname": "foo-bar-host",
  "pid": 58465,
  "requestId": "48eb72b0-088b-4470-b31b-ae19bfd561f0",
  "level": 50,
  "originalUrl": "/?error=yesPlease",
  "method": "GET",
  "error": {
    "name": "NotImplementedError",
    "message": "BOOM!",
    "stack": "NotImplementedError: BOOM!\n    at module.exports (.../koa2-bunyan-server/src/controllers/sleep.js:15:17)\n    at <anonymous>",
    "sleptInMs": 100
  },
  "duration": 104,
  "status": 501,
  "msg": "Reply sent",
  "time": "2018-01-27T16:25:58.787Z",
  "v": 0
}

What do we have here? A quick glance reveals the plentiful appearance of \n in that last log message. It makes me feel somewhat fuzzy inside, knowing the logs are practically screaming at me: “Here! Look here! The issue is here!” with that clean stacktrace. But stacktraces are not enough. They can tell you where you got, but not necessarily how you got there. This is where the rest of the logging structure and standards come in. Seeing the request ID on the error response log allows us to backtrack through the problematic flow all the way to its beginning. This, in turn, brings the root cause analysis within reach.

When was the last time you said ‘let’s deploy this debug logging and see what happens’? With this type of structure in your logs, you’ll say that much less frequently.

Count me in, I’m sold!

Getting both high-level visibility into your services combined with a laser-focus ability to get to the bottom of every issue that may occur? You should not expect anything less from your logs.

What makes this really worthwhile for me is the ROI. Set this up once for your service boilerplate, and reap the benefits with every new feature you’re shipping. What’s not to like?

The team aspect

All of the above shows the immediate benefit of structured logging. But we’re in this for the long run! Working in this way over the past 2 years has positively affected how we collaborate and share responsibilities as an engineering team.

logging team and logs on a cart

Onboarding

New hires joining our team spend a sprint getting to know the ins and outs of our system and codebase. Knowing that all services ‘speak the same language’ adds a very effective angle to this onboarding. New team members learn our system not just from the code and documentation, but from observing the production environment. Our preset dashboards and alerts tell a story worth listening to, helping new team members get up to speed with the real deal - our impact on our users.

Oncall

There are 12 developers in our team right now, each of us partaking in the oncall rotation for 7 days, approximately every 3 months. The week consists of both making sure our system is healthy, and working on customer support cases. We do not have an ‘ops team’, and are not planning to have one in the near future. Our shared responsibility is to build and operate our service for the benefit of our users.

Expecting every developer on the team to be able to efficiently troubleshoot any type of issue seems intimidating at first. Prior to their first oncall week, every developer pairs with a more experienced team member to learn the oncall tools. Having structured logging as a standard makes this training experience worthwhile. Instead of focusing on the individual intricacies of each part of our system, we train on the common tools to be used for almost every kind of problem. Learning what to expect of our structured logging, as well as how to use it, is a big part of this training.

Impact

My main responsibility on the team is to make sure we are making an efficient and positive impact. I’m always on the lookout for black holes that consume a lot of energy, but produce little result. One of those could be troubleshooting! Not knowing what to do to answer ‘why?’ given a symptom of the system can be very disempowering. When a customer support issue comes in, our logs are usually the first thing we look at. Their structure makes it easy to get answers, and also to ask the right questions.

I am really proud to say that our problem-to-root-cause identification cycle takes minutes in the vast majority of the cases. This motivates me to invest in this infrastructure component of our system (this post included!) so I can amplify the impact our team makes on a daily basis.

Structured logging helps build structured teams

Standardisation and structured logging is a form of terminology that glues teams together. It’s a critical part of joint ownership and breaking down siloed knowledge. Consider troubleshooting a vague error or symptom. Knowing what to look for is the difference between a confident resolution plan and a wild goose chase.

Covering operational aspects is a natural part of how we run our engineering team at Snyk. Structured logging is a key factor enabling us to share this responsibility across the whole team. I hope sharing this practice can help your team as well–and if you like the sound of how we are doing things, check our jobs repo to see if anything looks interesting to you :)

For years, the “DevOps” community has felt focused on one main idea: what if we pushed our ops folks to do more development? To automate their work and write more code? That’s cool—and it’s clearly been working—but it’s time for the second wave of that movement: for developers (that’s us!) to own our code in production, and to be on point for operating / exploring our apps in the wild.

Observability is that bridge: the bridge from developers understanding code running on local machines to understanding how it behaves in the wild. Observability is all about answering questions about your system using data, and that ability is as valuable for developers as it is for operators.

gif of Valkyrie crossing bridge

During the development process, we’ve got lots of questions about our system that don’t smell like “production monitoring” questions, and aren’t necessarily problems or anomalies: they’re about hypotheticals, or specific customer segments, or “what does ‘normal’ even mean for this system?”

By being curious and empowered to leverage production data to really explore what our services are doing, we as developers can inform not only what features we build / bugs we fix, but also:

  • How to build those features / fix those bugs
  • How features and fixes are scoped
  • How you verify correctness or completion
  • How you roll out that feature or fix

When wondering… how to make a change

There’s a difference between knowing that something can be changed or built and knowing how it should be. Understanding the potential impact of a change we’re making—especially something that’ll have a direct, obvious impact on users—lets us bring data into the decisionmaking process.

By learning about what “normal” is (or at least—what “reality” is), we can figure out whether our fix is actually a fix or not.

screenshot of bursty and steady traffic
What's "normal" customer use here? Are the steady requests "normal," or the bursty ones? Which customers represent which behavior patterns?

In our Check Before You Change dogfood post, Ben described what it was like to code up an “obvious” feature request: unrolling JSON for nested objects. But instead of just diving in and shipping the change, by gathering data first (and instrumenting our dogfood payloads to include things like the depth of the nested JSON, and how many columns it would expand to if we unrolled it), we were able to protect most of our users by making the feature opt-in-only.

A bonus example of this is our Measure twice, cut once: How we made our queries 50% faster…with data dogfood post: by looking carefully at where we could trim down perceived latency for the end user, Toshok found a way to dramatically improve the experience for most folks without going into the weeds of performance optimization (plenty of that happening already!)

These examples are our debug statements in production and are so powerful for us day to day. They’re lightweight, they help validate hypotheses, they carry metadata specific to our business (e.g. customer IDs), and they let us suddenly describe the execution of our service’s logic in the wild. All of this together helps us plan and build a better experience for our customers.

When wondering… how to verify correctness

Developers rely on all sorts of tests—unit tests, functional tests, integration tests, sometimes even end-to-end tests—to verify that our code does what we think it does.

And testing code for correctness on your machine or a CI machine is all well and good, but the real test is whether your code does the right thing in production. So what do we do when we aren’t quite sure about our change, or want to check its impact in some lightweight or impermanent way?

Feature flags (shout out to LaunchDarkly!). We love ‘em for letting us try out experimental code in a controlled way, while we’re still making sure the change is one we’re happy with shipping.

Pairing feature flags with our observability tooling lets us get incredibly fine-grained visibility into our code’s impact, and Honeycomb is flexible enough to let us throw arbitrary fields into our data—then look at our normal top-level metrics, segmented by flags.

screenshot of metric with feature flags
We're looking at a vanilla top-level metric, but segmented by the values of a few arbitrary feature flags (and combinations thereof!) that were set at the time.

This let us do wild things like: turn a feature flag on for a very small segment of our users, then carefully watch performance metrics for the set of folks with the feature flag on, compared to those without. In our The Correlations Are Not What They Seem dogfood post, Sam did just that, and illustrated the power of being able to slice the performance data by the precise dimensions needed to draw out insights and feel confident about his change.

Feature flags and Honeycomb’s flexibility let us use these ephemeral segments of our data to answer the exact sorts of ephemeral questions that pop up during development—so that we can do ad-hoc, on-demand canarying in a way that’d make our ops folks proud.

When wondering… how that fix was rolled out

As any ops person will tell you, the biggest source of chaos in a system is usually humans: developers pushing new code.

(If you’re lucky, they might even tell you twice. 💛)

The grown-up version of the ad-hoc, on-demand canarying I mentioned above (or less-grown-up, depending on how carefully the code was merged) is simply: being able to compare top-level metrics across build IDs and deploys.

Build IDs are a stealthy member of the “high cardinality” club: often a simple, monotonically increasing integer, that can nevertheless have an infinite number of possible values. But they’re invaluable for figuring out which build introduced a problem, or where something got fixed.

Anyone with a sufficiently large number of hosts, though, knows that deploys aren’t instantaneous, so timestamps aren’t quite enough, and it’s useful (and sometimes mesmerizing) to watch a deploy roll out across hosts, traffic switching over from build to build.

screenshot of graph broken down by build_id
Breaking down by something as straightforward (but deceptively high-cardinality!) as `build_id` can help visualize the real, natural behavior of our systems.

That second wave of DevOps, and what observability means for it

By making our observability tools use the nouns that are already ingrained into our software development processes—build IDs, feature flags, customer IDs—we developers can move from “Oh, CPU utilization is up? I guess I’ll go… read through all of our benchmarks?” to “Oh! Build 4921 caused increased latency for that high-priority customer you’ve been watching? I’d better go take a look and see what makes their workload special.”

Observability isn’t just for operations folks or for “production incident response.” It’s for answering the day-to-day questions we have about our systems, so that we can form hypotheses, validate hunches, and make informed decisions—not just when an exception is thrown or a customer complains.

If you’re interested in learning more, read about how we instrument our production services or (in excruciating detail) the datasets we rely on, day to day, and the sorts of questions we ask about our own services.

Or take Honeycomb for a spin yourself! Our Quickstart dataset is available for anyone to poke around in, and we’d love to hear what you think.