Conference Talks Observability Incident Response
The Business Value of Observability
This panel moves beyond the technical aspects of observability to examine the business case for adopting this new way of working. Observability can help teams lower their MTTR, lower change failure rates, and speed up time to delivery in production. How can you make a case to business stakeholders to prioritize observability adoption initiatives? Which outcomes should you expect and what practical hurdles will you hit along the way?
Transcript
George Miranda [Senior Director of Product Marketing|Honeycomb]:
Let’s introduce the panelists for this session. First, we have Bryan Liles. He’s a fixture in the cloud native ecosystem. He’s a Principal Engineer from VMware where he pushes folks to be a better version of themselves and he does talks like this. He’s always got great things to say. Next, it is my pleasure to introduce Dr. Nicole Forsgren. She is the VP of Research and Strategy at GitHub. Previously she was the lead author and investigator for the DevOps Reports. I’ve been lucky enough to work with her in the past and a lot of what we’re going to be talking about today is based on her research. And lastly, rounding out this all-star panel is James Governor, Analyst and Co-founder of RedMonk. James is all about focusing on the developer experience. He’s always outspoken. He’s got sharp insights. And so I have no idea how I’m going to keep up with this panel.
But let’s welcome everybody to the stage. Hi, everyone. Good morning. Thank you for joining us. I’m excited to get started. We have a lot to talk about today. Let’s do this. I’ll start here. Coming out of the o11ycon keynote with Charity and Nora, Charity mentioned the metrics and the value of shipping features fast. So I want to start there. Specifically with the DORA metrics. I want to go around and get an idea from the panel what impacts have you personally seen those metrics and the way of looking at our practices, what impact have you seen those have on organizations you worked with? I’m going to pick on Nicole last for obvious reasons. So maybe, Bryan, how have they worked?
Bryan Liles [Principal Engineer|VMware]:
Well, this is a really interesting question at this time in my life where I’ve moved but I work in telco now. People are thinking of data centers and lots of hosts. I think in terms of telephone poles and in regions, there can be tens of thousands of telephone poles. So think of huge distributive data centers with software that now let’s say we upgrade every two years.
When I think about DORA and I took a note because I wanted to say this correctly. With the DORA and the DevOps, we can update every time needed. Meantime to recovery, failure rate, and lead time for changes. I think that these are great metrics for really anywhere, but I think the application of these metrics especially in my domain is a little bit different than the device that I’m hearing right now from the status quo.
I will say this. The DORA metrics are a great base to start from, but we need to with any advice need to take into account where we want to end up and who our customers are. We want to apply that advice to even some of the great things that Charity Majors says. We can’t apply those things directly just because there’s some real physics that are stopping us. That’s just my idea there.
George Miranda:
That’s an interesting lens. I love that. It’s like the practices might be sort of aligned correctly, but the specific measures might need to change depending on application or industry. I don’t know, James. What’s your take on that? You see a lot of different companies in different settings.
James Governor [Analyst & Co-founder|RedMonk]:
Yeah. I think it’s a great question. And they’re all, I mean, in a way, t’s either an icebreaker or it’s a, please read the book. I speak to a ton of different people from different industries. And I’m going out there and what are the proxies for where they are on their journey? You know, if you’re in a group where everyone is like, I know about the DORA metrics and I read Accelerate which everybody must read. Then you get a sense of where they are on the journey, what they’re trying to kind of achieve. Are they ready for this change?
You know, sometimes you talk to an organization and you’re in a perspective where they’re, like to Bryan’s point, maybe they’re like, well, we want to change things twice a year. And you’re, like, okay. Maybe baby steps are required. What I’ve seen is organizations that are thinking about the work that, frankly, Nicole has done that are thinking about, you know, what it means to deploy more frequently, what it means to improve your meantime to recovery. What it means to do these things. You know they’re on the journey. It’s to my mind, it’s a bit like, let’s have this conversation. That is a great idea. And, you know, they may or may not. For me it’s a, are they on a journey or not? So that’s kind of what it means. Are they moving forward to more effective software delivery or are they, you know, maybe not so comfortable with that?
George Miranda:
Nicole, I’m dying to hear your take. What do you make of that?
6:05
Dr. Nicole Forsgren [VP Research & Strategy|GitHub]:
I love this whole conversation. This really is the right approach to take. Because it’s… so I’ve worked with and chatted with so many teams. And by teams, I mean developers and SREs, and admins. I’ve worked with leadership teams and I’ve worked with teams across a gamut industry. I have worked with internet companies. I’ve worked with telecom. I’ve worked with so many folks.
And I love James’s point that it’s about the journey. It’s about improving. And I really, really love what Bryan said; right? It’s about taking the measure and applying it to your context and sitting there. I think I get a little, I’m getting old. I’m getting old and twitchy. I’m getting very get off my lawn. When people call me saying you need to do a DevOps report because you did it wrong that I’m not in your chart. You are, but you’re just a low performer and it’s fine.
All that means is either we need to improve or it means we need to think about how we can apply it. Because this has been done in printed media. This has been done in silicon chips. This has been done in things. And what it might mean is we need to think about how we can apply it contextually and smartly within our context.
I mean, I actually had an orthopedic surgeon call me and say he read the book. This does not make any sense, but there were large portions of the book that really made me rethink what this meant within the hospital and what this meant within my operating room. And I hadn’t thought about it this way. And I was like, I’m sorry, what? But what it does is makes us rethink the opportunity and the possibility to apply some of these concepts within our domain. And within that, I really, really loved Bryan what you said about what does this mean within some things, it’s just physics. It’s not going to be real. But if you read this and your reflex is, no, this isn’t for me and you toss it. Maybe not. Maybe what this means is I need to look for an opportunity to look for ways to improve. What does this mean within my context? What does this mean for a continuous improvement journey? Because really the goal is improving outcomes for our customers. And outcomes for our developers and our development teams to have better ways of working.
George Miranda:
I love that view of looking at everything in context. One of the things I love about the research you’ve done, Nicole, is year over year there are additional practices that are identified. How we improve, is that continuous journey. And it is all about finding the right context for the folks that are participating, are letting us know what practices look like in their organization.
And so to that end, I want to switch the focus a little bit over to observability. And where does that fit into what folks are doing? So where does observability fit with either, you know, high performing teams or in the telco industry or across the industry? Where does that fit within the spectrum of practices?
James Governor:
I’m going to jump in, because, you know, yesterday we had a great example. Vastly fell over, you know. A huge amount of sites then fell over. There was a bunch of content we couldn’t see. But that team got together and was able to get things up and running again in under an hour. And it was extremely impressive. And pretty clearly, you know, they as an operations team, as an engineering team have a great set of practices, have a great set of people, great set of processes and tools where they are able to identify problems that are causing problems for their customers and deal with them.
I think that for me there were a few routes you can take. Some people are purely thinking about velocity. And, you know, it might be for them where are we on the CI/CD journey? You know, we’re doing testing. How comfortable are we with automating those deployments? It might be another lens around, you know, just frankly developer philosophy. How do we do that? They might be coming in from a perspective of, oh. We want to be able to decouple deployment and think about progressive delivery. We want to be able to deploy to different cohorts.
For me, observability is going to be key to really succeeding in any of those dimensions. Good instrumentation, understanding the system. So yeah, observability. I like to talk about, are you comfortable shipping code on a Friday? And on a Friday afternoon. And a lot of organizations hear that and they’re like, there’s no way I want to deploy my new digital service on a Friday afternoon. If it goes wrong, the rest of the weekend everything’s going to be terrible.
You know, you ask Charity Majors, and she’s going to think, look. Not deploying on a Friday afternoon, that’s like a bad smell. That’s like you need to do a bunch of things. Be confident that you can ship at any time. And so for me, I think observability is one of the conversations you have. It’s one of the underpinnings for production excellence, which really touches a number of different dimensions.
So yeah, I think there are different ways you can look at the journey. If we think about it, you all are now doing the observability maturity model, thinking about that. And to my mind, that builds on the work of folks like Nicole that builds on the fact that there are lots of roots in improving how we build and deploy software. But yeah. Good troubleshooting tools are not optional.
12:50
Bryan Liles:
I have to comment. Years back, we talked about monitoring. Shoutout to people who knew what Netsync is. But then, shoutout to Charity again for helping popularize this whole idea around observability. You know, lots of people were talking about it over watching what Charity’s been doing the last few years. That’s a lot of the thoughts are around that. But you know what? All this is garbage. We all have something to sell. And generally what we’re trying to sell is what we lead with. Nicole is out here. She’s selling books. Her book is on the shelf behind me right this second.
I’m trying to break this down further. Observability. What is that? We want to have a system that is observable so we could take action and make changes on it. At the business level, we think about SLAs and SLOs, and SLIs. We think only in that realm. When we think about it, we too quickly hop to metrics tracing and logs. And I’m like, you know what? That’s just how the systems are. That’s just the limitation of our thinking. And that’s why I’m still pushing people up to SLAs. I have an agreement with somebody else I’m going to deliver something and then I use my objectives to make sure that I can determine. And then in the indicators below that to make sure that tactically I’m doing the right thing.
You can take from business metrics at the executive level down to the directors down to the engineers or people writing code. Then down to your operations people and SRE teams. It all translates differently and it’s all different code. But at the end of the day, it’s SLA, SLO, SLI. The reason I think that way, it allows us to drive forward without getting lost in the mess of Honeycomb and then whatever other product you want to do. Or, you know, OpenTelemetry. Or I work at VMware. Well, no. What we’re really trying to do is agreements, objectives, and indicators. And everything else is noise. Necessary noise, but let’s focus on what’s important.
George Miranda:
Yeah, I think that’s a really good point. It is about, I guess, those indicators that show that you are delivering a better experience; right? And to your point, it’s not about data types. It is what are those capabilities that are unlocked. What are those capabilities that help me understand those systems so I can deliver those better outcomes? And I know, Nicole, observability. Where does it fit in the DORA metrics approach?
15:47
Dr. Nicole Forsgren:
I really love Bryan’s point there. It depends on how you define it. When we ended up investigating this, I want to say it was 2018. In order to try to study something, like, figure out if it makes a difference, at least in a bunch of the DORA work or times I could have studied things. I steered clear of any tool. And we do is focus on the capabilities it ends up delivering.
When we studied this, we found out that observability and monitoring loaded together. People kind of conceptualized them similarly, but it’s probably because they do similar functions. But they were both predictive of, you know, this high-performance concept or software development performance. Our ability to ship software with both speed and stability. But what we ended up, how we ended up defining these, and by the way, we talk about it in the report. We think if we focused on folks who specialize in this, it would probably tease apart into separate things.
The way we defined them was that monitoring is probably a set, not probably. We defined monitoring as a set of pre-defined metrics. So things like logs. Things like, back in my day. Things that we were used to having set up. Logs, metrics. Things that are already coming in. Things are setting up in a database. And then observability as a way to… it’s like a set of unknown unknowns. It’s almost like this version of debugging. If it’s not already there, but I know there’s a problem.
Based on my definition or a definition of a handful of people in the observability space, I think that that is something unique and important, and interesting in our ability. To Bryan’s point, deliver those SLAs and then meet those SLOs and then think about at least a couple of those SLIs maybe. Or if we’re not quite meeting the SLIs, how do I think about focusing on our users. And I peeked in the chat. I will answer that better. But how do I pay attention to the right MTTRs; right?
If one thing has gone down, it’s not affecting users. But if it’s gone down but my logs can’t answer it yet; right? That’s monitoring. It can’t tell me what’s wrong. I think this is the unique case where observability steps in and it gives me this superpower to kind of link through and debug and explore and dig through my system in ways that I might not have pre-built into the systems because either it’s intensive, expensive. Or it just didn’t even occur to me before because I haven’t had that failure before. It really wasn’t there.
And I think this is kind of that combination where we want to have both monitoring observability and how that helps us contribute to serving our users, serving our team, serving our customers.
James Governor:
Yeah, I want to jump in a bit and agree with Bryan a whole lot more. I think sometimes we get confused about, like, yeah. Okay. Agreements, objectives, indicators. I think a great example, my business partner Stephen, he used to work as an SI and he goes to one of his clients and they’re like, we want this, we’re going to do that, we need to work out how much this is going to cost us to do 24/7/365 support. And, you know, we think that’s a good goal. We’re going to do that as a business.
And he was like, okay. Let’s talk about that. And they’re like, no. It’s really important. We need to do this. And he was like, well, who’s using the system? And he said, oh, that’s our customer service operatives. He’s like, oh. And what are their working hours? And they’re like, well, they work from 9:00 until 6:00. And he was like, yeah, okay. So you’re talking about a system to support 24-hour operations and yet you only actually need to be working for, like, eight hours a day. And in terms of U.S. time zones, maybe a little bit longer than that. But really thinking about, the business needs to think about what’s the agreement. And then that’s what you optimize for.
I think that’s one of the things. You know, it’s easy to get confused about, the thing is if you are optimizing for 24/7/365, that costs a lot. And you have to understand why you’re investing in that. So I think the agreement objective indicator point that Bryan makes is honestly right. And sometimes the business doesn’t understand it. That’s one of the things we’re trying to do here. Have a better conversation between the technology, the people running the technology, and then the business.
Dr. Nicole Forsgren:
We also need to have better conversations with the business people and the technology costs. Because sometimes they’ll walk in and just say oh, I need five nines. And I’m like, but do you? Do you understand five nines versus four nines or four nines versus three nines, because there’s a big delta there. Do you really need this and do you understand?
21:20
George Miranda:
Let’s talk about that, do you really need this. I want to ask a more brass-tacks level question. This is, we’ve talked a lot about the right indicators and the outcomes that we’re going for. But I want to zoom back a little bit and think about our audience a bit; right? If I am an engineer that owns my code in production, how do I pitch some of these things to business stakeholders that might be demanding those like I want those five nines?
Dr. Nicole Forsgren:
Turn it into money.
George Miranda:
In a start-up, maybe they’re three or four levels away from me. Seven or eight levels or more. As an engineer working on a team, what can I do to help align that vision for what we can actually deliver and what my stakeholders expect?
Dr. Nicole Forsgren:
I turn everything into money and risk.
Bryan Liles:
Yeah. Money and risk. Well, really, this is a life lesson. Learn how to convert everything you’re doing into business value. You know what? I don’t care what language you use. I have 300 or so odd engineers in my immediate purview and access to a thousand more. I don’t care what language you use. I write Go and Java and Typescript. But I don’t talk about those. I talk about the integration between systems. I talk about the outcomes of the systems. I talk about business impact.
And you know what? If you want to convince people who have purse strings, don’t just talk about you will make more money if you do this. No. Say I’ve reviewed all the scenarios. And if we do this over this one, this will allow our customers to have this much more success over time. And not just for the next quarter, but over the next few years. That’s what people want to hear. That’s what MBA types want to hear. That’s what people with money want to hear. But for us, as tech people don’t think that. We think I’m smart. I’m going to use quarter words and people are going to be awed. Surprise. They’re not.
Dr. Nicole Forsgren:
The quarter word. Clear, concise, easy language is much better. And you can tie it back to a recent event, that’s helpful too. If you just had an outage, this will avoid something like that.
James Governor:
Yeah. Lots of conversations.
Dr. Nicole Forsgren:
Something good like this.
James Governor:
Like I say, we had that big outage yesterday, had that conversation. I think Spotify does this really well. Because everything in engineering is from the perspective of FTEs. So how many full-time employees and engineers are required? If you’re going to invest in technology, how many people are you saving? If you’re going to invest, there has to be a reason for doing that. And the reason for doing that is you can have more people working on new problems.
And so everything for them and this is true of, you know, the engineers. They’re driving cost management of their cloud spend into the engineers thinking about this. The engineers see it, like, how many engineers can we gain the business if we make this change? And I think that’s again, that’s, like, people are money. In technology today, I think everyone understands that, well, no. Not everyone understands it. Everyone should understand that talent is the most important thing. And so your ability to hire is super important. If you could say, hey. You know, we’re going to save five people over the course of X amount of time so we can invest in them as a business, that’s a great metric. I think it’s really smart.
George Miranda:
Well, I love it. With that, Bryan, Nicole, James. I want to thank you for participating today. We’re almost up on time, so thank you. And I want to remind our audience that you still have time to squeeze in a few last questions. Go to the o11ycon panel track Slack channel. We’re going to drop in there for additional Q&A. And we’ll see you there.