Conference Talks Observability
hnycon Keynote
Join Honeycomb co-founders Christine Yen and Charity Majors for a look into the present and the future of observability, as well as the future of Dev and Ops. Rich Anakor, Chief Solutions Architect at Vanguard, will show how Vanguard is realizing that future today. Liz Fong-Jones will demo the latest and greatest in Honeycomb, including major product announcements revealed for the first time here.
Transcript
Christine Yen [CEO & Co-founder|Honeycomb]:
Welcome to hnycon, our first ever user conference. We couldn’t be more excited to have you here today. We have announcements for you and shiny new features and a fantastic agenda about how people are viewing Honeycomb. To kick the day off, I thought I would share a little bit about how we got here and the philosophies that underpin what Honeycomb is today.
When Charity and I first met back in 2012, our first few interactions at work looked a little something like me, as a new dev, new to the team, shipping some new component or feature, and her, the tendered ops person, grudgingly working with me to get my new thing, production, ready. At the end of the process, she would point me to a dashboard to say, Okay. This is the one to watch to make sure everything is okay.
I would nod and say, Okay. Ready for production. What I didn’t admit to her was that I had no real confidence that I would know what to do if a graph started looking funny. I was pretty unfamiliar with shipping software at that scale, and, honestly, I was embarrassed. She seemed badass, so did all the ops people. I was the new dev on the team. When things started to go sideways, I was not helpful. I would mimic the ops folks and scroll through the dashboards, but I didn’t know how to map the graphs back to the logic. They were talking about ops things like Cassandra write throughput and I was trying to reproduce the problem locally and construct a test case and deploy a fix, you know, dev things.
We eventually found the issue, sometimes slapped a band aid on it, and moved on to the next project. Neither was having fun working with each other, but we moved on. When we started working with Facebook, we started using an internal tool that eventually became the inspiration for Honeycomb. It didn’t look like any previous logging or monitoring tools. It let us throw all sorts of data into it without having to worry about schemas or cardinality. Before long, we found ourselves using the same tool, happily this time, to look at production. She was able to use the tool for traditional purposes, since she and her team still owned Infra. But I was using the same tool to support the service that I built, trying to dig and dig and understand why my code was behaving strangely for some customers and not others.
The benefit of hindsight, one might notice that those are the same behaviors, just on different time frames and with different levels of pressure. Without us really trying, once we were able to use one tool to combine her nouns, right throughput, host, mobile utilization, with mine, the deploy ID, user ID, API endpoint, we were using the same tool for everything. When we first started Honeycomb, we thought we were only building it for the giant systems of the world and the Charitys out there, carrying the pagers and dealing with the pressures of being on call.
But during the first year, we realized that, actually, we were building for everyone because everyone is dealing with these problems of understanding production, and not only is dev buy in necessary, things that matter to the logic, getting dev buy in makes observability as a whole better because it’s better when devs feel better and we’re comfortable in production. Everyone is served when devs and ops can share ownership in production. So we reset our spot on the spectrum from hard ops to somewhere between us because these roles are blurring, have been blurring. And since then, Honeycomb has been built for teams who recognize that, who want a tool to facilitate the blurring of boundaries, and who want to create the shared language across dev and ops.
5:00
We made a set of early predictions and bet the company on them that have only been confirmed over the last five years. First, the complexity system has, in fact, only increased. Yes, there’s been some backtracking, some macroservices instead of microservices. It’s all more complicated than it used to be. More top engineering teams have started putting developers on call. Like Outreach, who we heard from yesterday, the lessons they learned by putting their 200 developers on call. We knew customization would be key and underlining the baseline work would be part of that. OpenTelemetry has made it possible to build a more portable and customizable future.
Finally, this one is still on its way, but we thought developers would start to think about production as an extension of their development environments. So in its early days, but for anyone who caught yesterday’s talk with Michael Haberman, Dev, and CTO with Aspecto, it’s happening. At our very first o11ycon, over 3 years ago, we included a set of open spaces where attendees could come together and talk about what observability meant to them. The goal was to come out with what observability could mean for the future and share our emerging best practices. Today, we’re not debating definitions of observability anymore. It’s about damn time. The industry has embraced it. The analysts have blessed it, and real teams are seeing real results. It’s not all about the promise of observability anymore but demonstrated results.
Finally, we don’t all have to include a five-minute digression at the start of our talks about how it’s different from monitoring with the various data structures we use. Instead, we can have more fun conversations. We can take the word “observability” at face value. What did our teams gain today by being able to see into production? How do we level up? What processes can support these observabilities? And what supercharges these observabilities? And how do we ensure our ability to deal with complexity keeps up with the times? I know one way of dealing with complexity that doesn’t work, doubling down on what used to work. Instead of three dashboards, loading to 30. Building more run books, defining more time series. We’ve seen that world. Some of us have lived it. It ain’t pretty.
The other approach is the one we’ve talked about nonstop on stage for the first three years of Honeycomb, accepting that we’ve left the world of known unknowns behind, and the only way forward is to embrace that chaos, to embrace the unknown unknowns and instead iterate and explore our way to clarity. The only way to do that is to foster attitudes that make our engineering teams better so they learn to think about and anticipate the weird stuff happening in production. Being curious, that willingness to frame questions like humans and dig into that funny graph on a spike, funny spike on a graph, and examine our assumptions, that’s the way forward. That’s how we deal with unknown unknowns. That’s what we’re seeing more of today as folks adopt observability.
The ray of hope here is that behavior isn’t brand new for most of us. We humans, we engineers are great at exploring problems and asking questions and iterating on a hypothesis. It’s a very human behavior at the core of this ability to observe and understand production systems, and it’s wonderful to see old school ops and old school devs recognize that they can and will and should learn to explore the production software like this too. That’s the part that has to happen outside the computer anyway. Under the hood, there are a few technical requirements that have been core to Honeycomb’s architecture that are uniquely capable of unlocking those human behaviors. First, support for arbitrary combinations and high cardinality data. You should be able to capture data that describes your business, your customers, your configuration in a way that makes sense to your engineers. You should haven’t to predict ahead of time which attributes will be important alongside other attributes so you can build indexes on them. Effective debugging cannot rely on predefined, pregenerated lines. That flexibility is key.
Second, near-instant query performance. Humans have shorter attention spans than ever these days. Science has said it, and, thus, it must be true. It could feel like a mystery hunt. You’re going from one clue to the next, looking under rocks, occasionally backtracking. Each of these clues is a result of a question we ask. If the question doesn’t come quick, we may move on to something else. It’s easy to give up and sweep remaining questions under the rug if it’s painful to keep investigating and the crappy answer we’ve got seems good enough.
10:49
Within Honeycomb, our query performance is a point of pride because we know it’s a key to ensuring that our users are willing to keep digging and get to the truth. Speed and flexibility, when you boil it down, the ingredients for observability are not so complex after all. Our engineering always do and will always know best how to sniff out and know what stinks in your production systems. So a tool that allows them to be more themselves and do their investigative work more effectively just needs to be fast and fit to them.
One of the most exciting parts of growing with our customers is seeing what happens when they’re first armed with Honeycomb versus what happens after they’ve had a chance to grow with it and meld it with existing workflows. Everyone starts with instant response use cases, debugging hair-on-fire problems, and reducing downtime. But once folks are able to catch their breaths, the thought pops up. Hey, why don’t we shift this to be more proactive? Can we measure performance in this production? Can we use Honeycomb to understand and anticipate problems in this component we haven’t actually technically released yet? When you start to build up and ask questions and be curious and iteratively interrogate your systems, you realize new points in the process. You add new instrumentation and get new data, and your questions evolve over time. That’s the promise of observability. It’s that drive, that curiosity, the ability to understand not just how your applications and infra are behaving today but how to make your software perform better tomorrow.
If you look at the product Honeycomb has built and continues to build, you will see these core beliefs are manifested in the experience we have crafted for you. We obsess over query performance because we know it affects how you use the tool. So we dug into it ourselves to look at our query. We want to ensure that playing with data feels like playing with a hypothesis. We’ve opened up all the graphs that have frustrated us in the past and make sure you can stick your hands into any graphs you want. We never want to hit a dead end. We offer a query history, an undo, and permalinks you can access forever. We want to make it easy and intuitive to reuse past work, so we leave breadcrumbs and interesting queries for you to jump off from, whether for your teammates or just your past self. And we optimize for you sending the data that matters to you.
We talk endlessly about the value of high cardinality data, like build ID or customer ID, and ensure that you can send whatever new fields are important to you today. People like to tell Charity and me we’re being too idealistic about the changes we were trying to make. All people wanted were better dashboards, but we’ve never lost sight of the very human side of understanding production systems. We’ve tried to build a reality that would allow all of you wonderful humans to be able to do just that, to be your clever, intuitive, contextual, exploratory, engineering human selves.
15:03
People don’t just want better tech. That’s not enough and never will be, not at the rate our world is changing around us. To borrow another phrase from Charity, The future of software is a sociotechnical problem. Technical advances unlock human behaviors, and human and organizational tendencies necessitate technical change. The socio and technical are interwoven, not the two distinct halves of a whole. So we’ve built this for the industry, your humans, to use.
And over the years, this is what we’ve heard you’ve done with it to enlighten users. You’ve made on-call suck less. It helps resolve things faster with being able to tease things apart rather than understanding aggregate. You’ve made your customers happier. Honeycomb allows you to make performance a priority and optimize every aspect of how your software runs. You’ve built feedback loops into all sorts of adjacent processes. Honeycomb plays nicely with progressive delivery among others so you can embed tools into your existing workflows. And you’ve built high-performing teams. Honeycomb helps teams pinpoint areas to improve, transfer knowledge, and ship fewer breaking changes in the first place.
It’s been pretty incredible. So that’s what you will hear about later today. That’s what Honeycomb is about, after all, hearing from real customers. But I promise you some new shiny. Liz, will you walk us through what all this looks like from an engineering perspective and maybe show off our new releases along the way?
17:12
Liz Fong-Jones [Principal – Developer Advocate|Honeycomb]:
Thanks, Christine. I’m really excited to be here at hnycon with you all today.
Hi. I’m Liz. I’m an engineer at Honeycomb. Today, I would like to show what Christine means when she talks about designing for curiosity with features that are fast and fit both developer and operational needs.
We’re going to look at what it means in practice to reimagine the workflow of releasing software to production. I’m going to show you how we release software at Honeycomb, and how we debug it, and how that differs from other places I have worked before.
Because, in a previous life, we did big-bang releases. We only shipped software about once a month in big batches to production. And we wrestled with having many different release branches. We spent hours rolling back any time misbehavior turned up in prod, and we struggled to diagnose several performance problems such as individual users having a slow or erroring performance.
Now, it’s true that companies have gotten better at performing small batch releases more frequently, but moving faster doesn’t necessarily translate to more safety. There’s often no real confidence that deployments are truly production ready, and engineers don’t really have a good understanding of how their changes are going to behave in production. So we’re packaging up and shipping changes to production often and automatically, but the net result of end reliability and distrust of production is the same. If issues happen once that feature reaches production, with older tools, it’s difficult to diagnose why issues are occurring, and we often have to roll back as the only way to fix the problem.
But, fortunately, we’re able to move much faster and more reliably at Honeycomb. We practice continuous delivery, and we integrate observability into our entire software development life cycle. We call this observability driven development.
So today I would like to show you how I modify, release, and debug new Honeycomb features we’re shipping in production by using a second copy of Honeycomb. I’m going to show you how Instrumentation, Build, Service Level Objective, Querying, Metrics, and Collaboration Features all combine together to make for a fast and reliable developer experience that empowers our developers to have curiosity and understanding.
Let’s show you Honeycomb’s Query Data API, it’s a new feature that’s being released today that allows you to run Honeycomb queries programmatically and obtain results as data that you can integrate into your workflows any way you see fit.
For this demo, I would like to show you how I can make modifications to that new feature. Let’s suppose I want to improve the freshness of results for beta customers of the Query Data API. Let’s go ahead and cache the results for less time so they can get fresh data every 10 seconds, not every 10 minutes. And let’s allow them to do that not for just 24 hours of data but for an entire month at a time.
So we’re changing these parameters and we’re adjusting the handlers that handle query, rate limiting, and caching. We’re going to kick off a pull request that will test this change before being released into production. And then, CircleCI will go ahead and start building my software. Now I can follow what’s happening inside of the CircleCI UI and that will give me the web of dependencies but it doesn’t tell me what’s happening at what time; what happened when. So, instead, let’s look at the build job as a trace inside of Honeycomb.
When I look at that view, it enables me to understand why my build is unusually slow and what the slowest part was. This is something that Honeycomb Build Events enables, and it integrates for us with the CircleCI org but you can also use GitHub Actions or just run a shell command inside of your build pipeline.
Once my build has finished, I can go ahead and think about releasing it to production by clicking on the merge button. But, wait a second. Let’s make sure it’s actually production ready. At Honeycomb, what this means is I want to have enough telemetry baked in so I know how my production service’s behavior is changing and which users are impacted. If the performance is negatively impacted for any reason, I need to know what I’m looking for inside of the prod code.
Let’s review how we measure and understand success. With OpenTelemetry, we’ve added custom instrumentation fields like cache hit miss or team and data set ids and we have inter spans for every individual function call that might take a lot of time. And this is all in addition to the automatic instrumentation that OpenTelemetry adds on every API request.
These bits of telemetry tell me how the caching performance has changed and who that change is impacting. Additionally, we also have service level objectives set for all of Honeycomb as well as for this specific API service. So we’ve defined what the success criteria are for our beta customers who are using this new query data functionality.
In this particular case, we would like Query API data to return in fewer than two seconds. If results take longer than two seconds to return, that’s a bad experience for Honeycomb customers, and we want to count those queries against our error budget. Remember, slowness is the new downtime.
Of course, we’ve also zoomed out and looked at all of our service level objectives just to make sure everything is in a good state before we push any releases.
As a developer, I’ve clearly designed success and failure criteria for how my changes will impact production. I know where in my code I can look if I encounter any problems and I understand the state of production before my change is introduced. That is what it means to be production-ready.
At Honeycomb, it’s my responsibility to watch how our changes impact users in production. When I ship, I’m expected to look at production behavior alongside whoever is on call from Engineering. Our observability-driven approach really helps us with being able to understand code changes as they’re being rolled out with the help of great tooling like Honeycomb.
Now, I can hit the merge button and make sure the change is built within 10 minutes, and it automatically gets shipped to all of our environments within an hour. Let’s go ahead and wait an hour and see what happens. Well, that’s less good. Within an hour of my change shipping the reliability of the Query Data API has really taken a nosedive. That’s unfortunate. The good news is that we found out before we exhausted our error budget because we got a burn alert that told us proactively. So, the Engineer on call and I both started looking because they are on call for the service as a whole, and I’m responsible for watching my specific code.
So we can see there’s a dip in availability, and Honeycomb points out what factors are contributing to unavailability. Which properties are shared between all the failing requests that happened after my code was released to production? I want to understand “why did my code fail?” and “why didn’t this turn up in pre prod?”
Honeycomb’s integrated BubbleUp feature helps us understand what keys and what part of the key space is broken. You can see a few things. First of all, the heatmap lets us see the majority of queries are still faster than two seconds, and some are slow. So that’s a bad customer experience. We consider that a failed query.
We can see the slow queries come from three specific partitions from one or two specific build IDs, and they all have a high number of results being returned. A high number of result groups. This helps me drill down and understand what are the common factors of performance happening here, and how might I be able to stop this SLO burn from happening?
24:53
Let’s drill down even further and get a record of this customer’s queries using the Query Data API on the specific dataset. Now, as you can see, there are only maybe half a dozen customers here with early access to the Query Data API, but I can easily apply it across all Honeycomb queries being run across tens of thousands of datasets. It doesn’t really matter. Honeycomb can query and group by any arbitrary number of cardinality fields; and, query across arbitrarily many of them as well.
Let’s go ahead and look at the performance for the specific dataset, and let’s also maybe zoom out and group by dataset ID and have a look at all the datasets together now that we’ve examined how recently this behavior started happening. Let’s use Honeycomb’s new Time Comparison feature which allows us to understand how this customer’s queries and all the queries against this API have performed day on day and week on week. Is it that the customer is wrapping up and this is normal behavior during the weekday? Or have they suddenly started sending us more queries than normal?
In this case, we can see the slow query performance is suddenly flooding us with a lot of queries, and we’re not getting cached results. We’re doing work every time. We’re able to compare day on day and week on week to see what’s happening. Just to make sure it’s not just this one customer, let’s also have a look and see what’s going on with the other customers.
I’m going to note that there are a lot of other customers on this graph, and we may not necessarily want to look at all of them because there are a few customers here that have sent one query in the past week, and then they’ve gone away. They have not really sent us another query using the Query Data API. So what if I could declutter my graph and get rid of all those things that don’t necessarily matter in this particular investigation?
That’s where it will be helpful to use Honeycomb’s new HAVING clause feature that lets us focus on the relevant time series and it allows us to remove that clutter and show us only groups that have for instance that either succeeded or ate into the error budget. I’m going to go in and set the HAVING clause to show for only groups having a count of greater than two. This is something you couldn’t do before. Now it lets you not just query across individual events but filter across groupings of events. Doesn’t that graph look a lot cleaner?
Now we can see a couple of customers have been querying us and still been seeing successes, but it’s only one customer seeing consistently slow performance and is seeing consistently more queries. We can confirm this hypothesis by going into BubbleUp and highlighting a specific area that I want to look at rather than trusting Honeycomb to only show me the queries that have a failed service level indicator.
That enables me to do my own exploration and digging that Honeycomb hints to me, with its machine smarts, what fields I might want to look at. But let’s also share what I know as a human. Let’s share that with the rest of the team by sharing the query so the teammates can see this behavior in their query history. By doing that, I’ve saved the results in a shared lab notebook so we can all look together and see what queries we’ve run and which have been annotated with titles.
Then I can go in or my teammate can go in, for instance, my teammate who’s on call, and they can see in team-saved queries what queries I’ve run and named. So this helps us, as a team, debug issues faster because we’re all able to understand what’s going on in production together.
But let’s go ahead and dive in a little bit more and have a look at an individual trace exemplifying the slow behavior. We can see here this is a query that’s not just buffering and sat in a queue for two seconds. We’re actually spending two second of time in AWS Lambda as well as in the parent process of Retriever, which is our storage engine. So this user is not just hammering us with queries, the requests are taking a lot of time, and that time is spent chewing on doing that computation. So we should probably verify that the system, as a whole, is behaving well. We are going to need to go back and have a look at the system holistically.
Let’s go ahead and have a look first at which hosts are most negatively impacted by this change. We can see here that some of the hosts are perfectly healthy, but there are three particular hosts seeing a high number of errors being returned and high latency. We could filter to that individual host name as well as being able to filter by dataset and so forth. But, in this case, I would like to show the metrics for my entire system together.
We can see the CPU of all the Retriever workers, and the memory of all these Retriever workers as well as a metric from AWS, the current number of concurrent Lambda executions. I can go ahead and do something like roll over an individual line to see what the behavior is of that individual hosting. I can also dive in and look and get an understanding of how that host name’s behavior differs from the other hosts just by rolling over and comparing. But if I need to see things in more detail, I can, for instance, click on any of these graphs and see the graph blown up to full scale. It’s not just Lambda executions. I could have picked any number of CloudWatch metrics that I wanted to plot here that are automatically ingested into Honeycomb.
Let’s filter to just this individual hosting that’s behaving slowly. Let’s go ahead and apply that query filter. Let’s go ahead and filter specifically to that host to understand what is the CPU utilization of that host? So the filters and groups are reflected here, and that allows me to quickly understand everything that’s happening all on one screen.
So overall we can see that since we deployed the shorter caching, each of these queries is running and running on CPU on the host and on the Lambdas. That means we need to do something to address this because this one customer is really slamming us in production, and we didn’t anticipate or see this in query production.
So now is an opportunity for me to go back to my SLO and try to put a stop to this behavior before we burn through all of our SLO. So this set of users is having a hard problem. Let’s go ahead and go into LaunchDarkly and turn off this user so that it stops impacting production. Let’s flag this user off for now so that way we’ll stop burning through our error budget, and we’ll be in a lot better shape. So looking at this change, you would think it would be a simple change but turned out to have such large ramifications on my entire production system.
So that means I need to go ahead and make sure I’m mitigating the impact of what I do, and that way I can go ahead and take my own time fixing the overall issue. So now that we’ve, you know, mitigated the issue by turning off this user in LaunchDarkly, now we can go ahead and think about what we’re able to do using Honeycomb.
Honeycomb enabled us to catch this issue before it burned through our entire error budget and allowed us to debug and understand what was happening. In this case, that the user’s queries were suddenly no longer being cached and therefore were being allowed to execute against a real live storage engine 50 times more often. Instead of executing once every 10 minutes, it was once every 10 seconds. That’s really part of what the power of Honeycomb is.
Honeycomb allows us to understand what’s happening in production, and it enables us to understand how those changes impact customer experience. We can make design decisions, like whether query cache duration should be a fixed setting or whether we need to make it adapt to the workload size of each customer. This is what it means to give developers fast tools to fit our needs so that both operations and development concerns can be understood from the same interface.
So, at Honeycomb, we’ve reimagined an experience of deploying to and debugging production.
And using these same approaches, your teams, too, can work together to release features quickly and reliably with Honeycomb. You can center around doing what you do best, building customer experiences that will delight your customers.
Thank you very much for your attention and enjoy the rest of hnycon.
33:30
Christine Yen:
Holy cow. Good thing we’re posting these for later so you can watch it at your own pace. Same Honeycomb bones, a new look. We have a lot to announce today. First, as you saw in the demo, we have a new query experience that’s available to all Honeycomb users starting today. For the long-time Honeycomb users out there, I hope you’re excited to see autocomplete for filter values and time over time comparisons, as I am. For the power users out there, you get a new “having” clause. Being able to group by high cardinality fields is great. Being able to filter on the group output to surface outliers is better. And I’m personally thrilled about our dramatically improved keyboard friendliness. From a set of keyboard shortcuts to respecting undo and redo commands to all-around better keyboard-driven UX.
Second, at the core of Liz’s story, you saw her debugging a new feature we’ve been developing. Starting today, in open beta, we’re announcing our full query data API so that users can fold the ability to quickly and flexibly query over high cardinality data into your workflows. Observability and Honeycomb, it’s better with friends. So we can’t wait to see what creative solutions you all build with this API.
And, last but certainly not least, today we are announcing Honeycomb Metrics. Now, if you followed Honeycomb or Charity or Liz or me or anyone from Honeycomb for a while, you know that we’ve typically taken a very firm stance against using metrics for debugging. For a long time, we resisted building metric support into Honeycomb because when all you have is a metric and a dashboard, every engineering visibility problem seems solvable by adding another time series. And when it’s 10 times easier and 100 times cheaper to track infrastructure and CPU metrics rather than application logic and high cardinality business metadata, people choose the quick and easy rather than what’s hard and right, and, in doing so, limit themselves to the technical constraints of the past.
Five years ago, we focused Honeycomb on what was not yet possible, a fast, exploratory approach to engage with rich events containing high cardinality data and implementing engineering concerns with what matters to the business. We think we’ve done pretty well at that so far because, at our core, Honeycomb is all about context, helping service the right information in a graph, table, or visualization to help you, the human, understand whether a signal matters. Everything we build is meant to provide layers of meaning, not produce data silos.
What I love most about this approach is that we’re redefining what it means to have metrics data sitting alongside observability data. Observability is best suited to understanding how customers experience the code you write. It drives curiosity, and it gives developers in operations a common language to understand how business problems unfold in production. Fine-grain observability data is the best way to understand what’s happening in your applications. But not all engineering visibility problems are solvable by adding more fine-grained events. We know this. Sometimes the systems you care this won’t let you capture that level of fidelity.
And so aggregate measures like metrics help build a bridge between understanding what’s happening in your applications and understanding everything else that surrounds running those applications, like infrastructure or your runtime or anything else that constrains how well your code runs but doesn’t actually help you understand the behavior of the code itself. To focus on creating an unparalleled way to understand application behavior in production, we swung the pendulum hard in just one direction.
But we know system-level metrics still hold value. By making you go see those in another tool, we had been asking you to carry context in your head, to connect disparate parts of the whole on your own. So now it’s time for the pendulum between dev and ops to swing back. No more jumping between the tools and trying to piece together the whole picture from multiple sources of truth. Honeycomb can now be the source of truth about production and the center of gravity for your engineering team.
We have a bigger story to tell. I am sure you can’t wait to hear more about how it works. You can hear more about how we’re approaching the metrics world by attending two breakout sessions in the Honeycomb product track. At 9:00 a.m. Pacific, check out Alolita Sharma’s talk on building out OpenTelemetry. You can hear about how at Honeycomb we can use that work to get metrics into Honeycomb. Those are two deep-dive sessions you won’t want to miss if you’ve been waiting to get metrics alongside your observability data.
The customer experience is ultimately what matters to your users, your business, and your team whether you’re dev or ops. And a wise ops engineer once said that nines don’t matter if your users are not happy. When your users are not happy, the best code in the world can’t figure out why. It takes the curiosity embedded in excellent engineering teams and aligning those to serve customer needs. That’s the promise of observability that we’re here to deliver on.
Later today, we’ll hear from a range of different teams using Honeycomb, telling stories about how they’ve incorporated observability into their day-to-day. First, I would like to shine the spotlight on a very impressive leader, Rich Anakor, Chief Architect, Vanguard, because observability and leading edge practices are not just for the unicorns of the tech world. Vanguard’s business is all within a highly regulated industry where big change can be incredibly slow, but they’ve been able to use Honeycomb and OpenTelemetry to build change fast across a very large enterprise team.
They’ve changed developer workflows and the relationships development teams have with production services, and they’ve done the human work to ensure that all of their engineers can understand the layers of complexity introduced as part of this modernization initiative. Here’s Rich to share more about Vanguard’s journey with Honeycomb.
40:54
Rich Anakor [Chief Solutions Architect|Vanguard]:
Hi, everyone. I’m Rich Anakor from Vanguard. Today I’m going to talk about Vanguard’s journey with OpenTelemetry and Honeycomb. I will talk to you about this in three ways. I will talk to you about how we got started. I will also talk to you about where we are today, and I will talk to you about where we’re headed. And, more importantly, also, I would hope that this serves as a template for organizations our size and also for, you know, organizations in highly regulated industries, like the financial services industry.
To build a context around this, several years ago, Vanguard had this idea to move all of its workload from our data centers to the public cloud. There was this transformation that happened sequentially. So we moved from the data center to the private cloud. From the private cloud, we would go to the public cloud. What happened is we ended up in a state where we were running across these three environments simultaneously.
So we had services that had dependencies across the three. This layered in so many complexities for our support teams to really understand what’s going on in our environments. I joined Vanguard about two years ago, and my job was really to help build instrumentation and build out the telemetries that would really help our teams know what’s going on in the environment. We set up goals, and these goals were about how we can support our applications with these layers of complexity? How can we know what’s going on in them? We needed an approach that would help us understand this modern production environment.
We knew that our current APM solution did not scale. It was not really bringing that engagement from our teams. We knew we had to solve this problem. So how did we get started?
The one thing I want to highlight about how we got started in this journey, is we started really small. Starting small is really a good technique that I think teams should learn from.
I come from financial services: been there for more than a decade. One thing that you see is things like this that require approvals, require so many organizational involvements to really get an idea of the ground. In this case, that was not the case. There were only three of us: myself, an engineer on my team, and an engineer in one of our feature teams.
So when we came together, we knew the current approach did not scale. What can we use? What technology solutions are available out there that can tell us the patterns that are happening as calls are going from our data center to the private cloud to the public cloud and traversing back and forth? How can we see this? How can we help our teams reduce the meantime to recovery? That’s the goal we set.
So now what did we do? We knew one of the technologies that was top of mind, was we needed to use a distributed tracing approach. So we started looking at technologies out there, open tracing came to mind. But we needed a backend to send this trace information to begin interrogating our systems to see what’s going on. We looked at all the vendors. Honeycomb became a partner that wanted to work on this journey with us.
We started really small, as I mentioned. We started with one of the services that had dependencies across these environments. How did we get started? It was a small and self organizing team. We started with the instrumentation. We were able to get early feedback. We were able to see what was happening.
But we did this initially with Beelines. One of the approaches we wanted was something that was vendor neutral. We didn’t want our engineers worrying about licensing, or what vendor agent we’re installing, and all that stuff.
We went with Open Tracing at the time. But Open Tracing didn’t give us the auto-instrumentation capability we were looking for. We knew about OpenTelemetry and the progress happening in that community. We decided to try it out. We brought in OpenTelemetry.
Honeycomb didn’t care. They said, Whatever you use, our backend can handle it.
45:20
We started with OpenTelemetry. With OpenTelemetry auto-instrumentation was one of the main drivers that really helped us. We were able to propagate context across application boundaries, across environments, and we were able to interrogate these systems and really understand what’s happening.
Let me tell you a bit about where we are now.
We have hundreds of teams now using OpenTelemetry and Honeycomb. We’re able to bring a different mentality in the way we are able to run and manage our production systems. We were able to really help our engineering teams. We’ve changed the culture.
One of the things I will highlight today is how we think about production systems. We often think about APM as something that’s only used in production. Or, we use it to fight fires and use it to respond to incidents.
With OpenTelemetry, we found that, yes, it’s actually very effective in doing that. But, also, it helps you do analysis. I’ll highlight two examples.
One main example that we found, our teams discovered, was that there was a migration effort going on. And this team wanted to move some data to a new repository in the cloud. And they wanted to know all the dependencies that were involved. They wanted to know all the user actions, how they map back to these backend stored procedures.
They’d been going for months with spreadsheets, looking at code, involving really smart people, engaging, and really trying hard to solve this problem. But they could not. Because this was on-prem and considered a legacy application, we did not think we could help. This application had dependencies with other workloads in our private cloud. But, we said okay. Let’s try this out.
With OpenTelemetry and Honeycomb, they were able to answer these questions within minutes. Minutes! So that was key. It just showed our teams that this is beyond just responding to an incident. You can actually understand how your systems are behaving.
Another important thing that I want to highlight today is when it comes to measuring what’s happening in your production systems, really measuring what matters. So I will talk about Service Level Objectives and the impact this had in the way that we manage our production systems.
One example I would like to highlight today is a testimony from one of our teams. They had an SLO defined for a critical service. They actually got notified through a Burn Alert that they had to respond. And within 30 minutes, if they didn’t respond, there would be a customer impacting issue. They were able to respond. They were able to figure out what the issue was. They were able to remediate this issue before it became customer impacting.
These testimonials have energized our teams. We have a mandate, as we speak, that any application in our environment, any new service that’s being built, must be instrumented with OpenTelemetry and reporting traces to Honeycomb.
These have really changed the way we do business. It’s changed the way our engineers work. It’s changed the engagement. It’s made them more productive.
So now where are we headed with this? As we’re making these mandates, I think it’s important to connect the journey. So we’ve set some goals. I will talk about our year end goals. One of our year end goals is we want to move 100% of all our applications to Honeycomb using OpenTelemetry. That’s a difficult goal but one that I’m confident, with the level of engagement that we have from our teams, we’ll be there by year end.
One of the bits of culture we also want to adopt is to really drive down our Mean Time to Resolve. How do you do so? You do so by really knowing (1) when there’s a problem, (2) knowing where the problem is, and (3) knowing how to solve them. While OpenTelemetry is not something that directly resolves the issue, it happily gives you that power. Using Honeycomb allows you to slice and dice this data any way you need to see it. That’s the goal we have. We want to reduce our MTTR significantly.
So this is a really, really powerful thing that has driven value and every stakeholder on our teams has come onboard. We no longer have to convince everybody that this is a good thing. Everyone wants to join the movement.
One thing I want to leave you with, I want to leave you with this thought. For organizations like ours—I keep saying that because that’s where I’ve spent most of my time with companies like Vanguard—this may seem like a difficult journey to embark on. But it’s important to start small. It’s extremely important to celebrate the small wins. And, above all, engage the right people early.
That’s what we did, these three things. The success really comes with the result.
People are going to get engaged when they see value. There are always questions that people have about their systems, but they cannot answer them. When you give them the avenue to answer these questions, it becomes extremely powerful. That movement is something I’ve seen at Vanguard, and we’re excited about where we’re going, and we’re excited about learning more and sharing with the community as we make progress.
All I’m leaving you with today is to really, really think about your current situation. Think about the challenges you have with your systems. Think about what the questions are that you want to ask about your systems that you don’t have answers to. Maybe tracing is a good way to look at it. Maybe OpenTelemetry can help you. Maybe Honeycomb can help you.
This is the Vanguard story, and I hope to continue to tell this story as we make more progress. If you have additional questions, my contact information is available. Feel free to reach out to me directly and ask about our journey. I will be very happy to share whatever information we have.
All right. Thank you so much for listening to me, and I hope you have a great day. Thank you.
51:26
Christine Yen:
I’ve been talking about her all throughout this keynote. Now I’m delighted to be joined by Charity for some reactions to that story. Charity, what did you think about how Vanguard is using Honeycomb to change developer workflows and their understanding of what’s happening in production?
Charity Majors [CTO & Co-founder|Honeycomb]:
That’s pretty exciting to see people building careers on making their lives better and the lives of their co-workers better. This is what we dreamed about. It’s astonishing to see. I’ve never worked at a company like Vanguard and making this Quantum Leap.
Christine Yen:
I like the idea of starting small. It’s nice to have a huge initiative, to do a big thing and do it all at once.
Charity Majors:
Yeah. You do have to have the buy in, I think, but we’re engineering. We need to be shown and proven. It’s kind of like the old Google saying that you can only plan for something that’s an order of magnitude but bigger than you have now. Like, you can’t plan for the whole thing. You can have a hazy idea of the outcomes you want to achieve, but you have to start small, celebrate the small wins. What I really thought was interesting was that you said, Engage the right people early. What do you think that means?
Christine Yen:
I mean, I think it’s what we talked about early on. You have to find the people currently feeling the pain and fighting the fires, but there’s also that thing of, Hey, you build it; you own it. What does it look like? It’s not scary.
Charity Majors:
I think anyone who feels strong ownership over the systems, they have to get onboard. Sometimes that’s the engineers and execs. Often, it’s the most senior engineers, but people who care deeply about their work, you know, they don’t want to just come to work and do a checklist, right? They want to make things better. So you need to show them it’s better.
I like the range between experiment to mandate. At some point, you have to have a mandate, right? But you have to prove yourself through the experiment so people want to jump onboard with you and come with you wherever you’re going.
Christine Yen: There’s a real thread of, like, fear through this, right, producing fear? Hey, look, it’s an experiment. We’ll try this out. Hey, we’ve got a burn alert. Things are not bad yet.
Charity Majors:
Because people have a system that works. It’s good enough. You’re making money. Your salary is safe. It’s good enough. Who wants to rock that cradle.? You hear it can be better, but can it? You’re taking a lot on faith. That’s why I think it’s so smart to show your work as you’re going.
Christine Yen:
Yeah. That burn alert story is, again, what a win.
Charity Majors:
Yeah.
Christine Yen:
What a win to get something, fix something.
Charity Majors:
Before your users are upset. SLOs are, like, pre-conditioned. When they’re done well, what’s the movie it’s from when you have precogs?
Christine Yen:
Minority Report.
Charity Majors:
That’s the one. You know people are going to be unhappy and you’ve got a time window to affect the outcome. It’s so great because, you know, if you’ve adopted the SLO story, then you’ve also set down the burden of alerting on every single spike and ding and CPU and all the symptoms, which is what burns people out. This is why people don’t want to be on call for their systems. It’s not because they don’t want ownership. It’s the long history of masochism. We’re going to follow this grenade. Anytime anything wiggles, wake me up. That doesn’t work when you have this giant, massive, global system anymore.
Christine Yen:
Yeah. I really enjoyed the kind of growing drumbeat of the human factors conversation. People were talking about it in 2016, but with the rise of SRE people, it’s all about burnout and fatigue and how we avoid that, and how we use the tech that we have.
Charity Majors:
Using observability to shine the light on the dusty corners of the system that are not working well and that are tripping everyone up. I really liked that. I also liked, and of course I would because they’re talking about partnering with us, but I like the admission that you don’t have to have all this expertise in-house when you start out. You can partner with people like Honeycomb who have fought this battle before, who are willing to jump into the trenches with you. It’s not that we have all the answers, but we see part of the story and you see part of the story, and together, we can figure it out.
Christine Yen:
There’s something really special about our industry that’s so based on…
Charity Majors:
Apprenticeship.
Christine Yen:
Shared knowledge, helping each other and learning from each other’s successes and failures. I’m grateful and humbled and determined to keep…
Charity Majors:
Not everyone can have Liz Fong Jones on staff, but you can rent them, right? And it’s one thing to solve a problem once. I wrote that post about being a senior engineer. The difference between knowing your system and being a senior engineer, one is just, you know, your experience, and one is being a senior engineer. I didn’t put that very well, but one is knowing your specific system, and the next level is being able to extrapolate and being able to repeat that and being able to bring the entire industry up one system, one service, one team at a time.
Christine Yen:
And with that, we end today’s keynote and start the sessions. There are two tracks you should check out. One is sharing practical lessons by customers on their journey. The other is about different mysteries customers have solved with Honeycomb and how they’ve been able to use the product’s various capabilities to better understand their production environments. There’s also a Honeycomb product track that includes not just these deep dives on metrics but also shares best practices for incident response and covers our strategy with OpenTelemetry and instrumentation.
If you have any questions or comments or to find out which sessions are happening when check out our website or be sure to ask in our pollinators Slack in the o11ycon hnycon channel. Thank you for joining us on this inaugural hnycon day. We’re all here to build great customer experiences. And we at Honeycomb look forward to continuing partnering with you on your observability journey. Please enjoy the rest of your day here.