Conference Talks Observability Engineering Best Practices
Driving Observability Adoption Forward as an Internal Champion... During a Pandemic
Josh from Amperity covers how they planned, strategized, and organized their observability tooling evaluation. Josh also shares how the COVID-19 pandemic impacted that evaluation. He covers the engineering outcomes they were aiming for, given Amperity’s engineering structure, and the areas of focus that ultimately helped set them up for success.
Transcript
Josh Parsons [Senior Site Reliability Engineer|Ampertiy]:
Hi, everyone, my name is Josh Parsons. My pronouns are he/him. Today I’m going to talk about driving observability adoption forward as an internal champion of observability. I would like to share our experience about why we evaluated new observability tooling and talk about what I think helped us and set us up for success when we did most of our evaluation during the COVID-19 pandemic. Before I get into the areas of focus that I felt were important during our evaluation, I want to give a bit of background about why we wanted to evaluate observability tooling to begin with. At Amperity, we employ a distributed on-call model where each team has its own on-call rotation. And they have ownership within production. Within that model, I recognized that we were asking our on-call engineers to become proficient and knowledgeable about a lot of different tools, patterns, and interfaces to be able to do their job. Dashboards of metrics, third-party systems with live log output, and tools for queries unstructured logs among other tools. You know, two of the three pillars of observability.
We started to introduce and teach observability-centric concepts to help bolster the experience of emergent issue investigation and curious exploration. We needed something to help capture rich context in structured events in arbitrarily wide event maps.
We didn’t feel quite ready to adopt OTel or OpenTelemetry pattern, so to facilitate that process of collecting rich and wide context, we rolled our own integration library. For this to work for our valuation, it was important to ensure that the patterns to collect and observe code would be simple and as natural and understandable as inserting a log output line in code. It had to be responsible for tracking trace and span ID relationships on behalf of the user and adding or merging in new context into event maps had to be simple.
With this first version of our library ready to use, it was then important to define how we wanted to improve — define how we wanted to improve engineering outcomes. When I constructed our plan for how we were going to evaluate our tooling, regardless of the combination of tools settled on, we had to focus the evaluation on observability’s effect on a few key areas of engineering experience. Are we further empowering engineers with service ownership in production to see and understand how their services behave in live environments? Are we improving our collective confidence about what we deploy to production? Are we improving our meantime to detection and meantime to recovery, particularly around ugly and nefarious types of issues that can come up within a distributed services model?
I constructed a few use cases tailored to our situation that sought to measure the impact of our tools on those outcomes. I needed to focus on tangible engineering experiences in order to do a proper comparative analysis. It was important for me to hold open as a possibility that the status quo was a valid outcome. Maybe the tools I selected to evaluate don’t improve the outcomes at all. It was important to be open and receptive to the status quo as an outcome and still learn from the evaluation experience. The use cases also let me do both qualitative and quantitative analysis. Not just am I reducing meantime to detection and meantime to recovery on the quantitative side but also questions like when you’re investigating an issue, you get paged on, is this tool frustrating to use? Do you feel like using this tool helps you better understand your service? And so on.
4:43
Around the time we began the evaluation process, the effect of the COVID-19 pandemic fundamentally changed how we came to do work. Looking back at our tooling evaluation in retrospect, there were four areas of focus that I anchored to that I think helped us and set us up for a successful evaluation. And the four areas were learn, engage, instrument, and measure. For learn, it was important for me to maintain an attitude of learning and growing. There’s a lot of wisdom and experience in this field. And it was important for me to give proper credit to those who have been sharing these ideas for a long time and to continue to grow and expand my own knowledge in this field of observability.
For engage, it was vitally important for me to not only find engineering across our adjacent engineering teams to be key stakeholders for this process but also to do the work and keep them engaged and involved in the process. Because we were all working remotely, I couldn’t depend on hallway conversations or in-person whiteboard and brainstorming sessions, or any of that. Given our remote work arrangement, I had to stay on top of keeping the key stakeholders involved in the process, otherwise, I risked carrying the bulk of this evaluation on my own, which I don’t think would have led to a successful outcome.
After we connected the integration library across our product code, I helped coordinate product training sessions to make sure they understood how to use each of the potential new tools. I worked with key stakeholders and shared observations I was making within their event telemetry inside the tools. I made myself available to answer their questions and to walk them through any observability concepts they were curious about. For instrument, you’ll recall that I said one of the goals of our observability library was to provide simple patterns to follow. This was the time to ask the key stakeholders about their experience using those patterns. It was also important to find opportunities for instrumentation on challenging problems. I tried to pay close attention to tickets and pages that would get open that the key stakeholder would be asked to solve for. If I thought applying instrumentation patterns to those problems could improve future occurrences of it, I would talk to the key stakeholder about exploring those possibilities.
For measure, in order for me to make the justification for the recommendations we were making, I had to collect informal feedback and conduct surveys. I sought out examples of key stakeholders using the tools on their own and what they had to say about it. We ended up getting wonderful feedback from our key stakeholders and I think it helped me when it came time to share my findings with decision-makers at Amperity. These areas of focus are applicable whether there’s a pandemic or not. But I think that the impact of the pandemic elevated the emphasis I needed to have in these areas in order to succeed.
As a result of our evaluation process, we did choose to bring on a new tool, and I think the evaluation process and focus on these areas helped us make a well-reasoned decision that benefited us all. And it moved us in a direction of better and happier engineering outcomes. Despite the remote work, the remote-only work circumstances. And that brings me to the end of my talk. Thank you all very much for your attention today. And I’ll be more than happy to share more detail about our experience and to take your questions.
Corey Quinn [Cloud Economist|The Duckbill Group]:
Josh, thank you very much for giving that talk. A common observation has been that COVID-19 has done more to advance your company’s digital transformation than your last five CIOs combined. And if we take a look at across the landscape, that is directionally correct. My question for you, how much of these initiatives and I don’t want to use the word learning so let’s pretend I didn’t, would have been possible without having the forcing function of, surprise, we’re all suddenly very remote and had no time to prep for it?
Josh Parsons:
I think that, as I said in the last slide, the engagement part was so important regardless — all four of those points were important. They would have been important. But I think engage, in particular, and doing the leg work to keep people involved in the process, an adjacent engineer involved in the process was so important that I can’t shoulder that burden by myself. I think that whether you’re face-to-face, having the hallway conversations, or able to engage or not, I think — I think I still would have emphasized that engagement process and really building the case by getting a coalition of people who understand what we’re trying to drive towards in terms of the outcomes to seek those outcomes that we were driving for.
10:15
Corey Quinn:
Ben asks, what was the timeframe for this experiment? In other words, many of these changes take time to sink in. Were you able to measure a change in the timeframe that you had available? And again, if not, please don’t make us go back on lockdown so you have more time to evaluate the results of this.
Josh Parsons:
It’s important for contextual information that before the evaluation of the tooling process, we as an engineering team had dedicated a month toward increasing velocity, focusing on the aspects of, well, observability was one of those pieces that, you know, sort of highlighted as one place where we can increase. That was in October of 2019 when I kind of took that. But I took that time to like write out rational documents and sort of make the case for here is how we want to do observability at Amperity. Write them down. Talk about the goals and the outcomes that you want to have and share them — I had to share them with adjacent engineering teams. That took a few months, like three, two, or three months probably. Part of that process of kind of giving yourself momentum to sort of spread the word of observability and like what it’s meant to do in terms of getting to those outcomes, yeah, I mean, you need a few months to propagate the idea and get people onboard sort of like mentally speaking. But then, when the evaluation process started, I would say I guess we started that in, it was right after the pandemic started, it was like April or May when we started our tooling evaluation. By then, I had already selected my key stakeholders and sort of had direct channels of communication. And we finished the evaluation in August. So that was like three or four months of like tooling evaluation.
Corey Quinn:
And then it’s over, time to call it. Yeah, I wish.
Josh Parsons:
But the scope, to your previous point about the scope, yeah, because we didn’t have a lot of time for evaluation, we had to be very selective about the kinds of areas that we wanted to be tooling against.
Corey Quinn:
This one got a whole bunch of feedback questions. Can you share insight on how you get teams to even try a new tool in a mature environment? Because trying to get a decent size org to try overlapping vendor is daunting and, oh, my star, yes.
Josh Parsons:
Yes, I can definitely commiserate with that. That is a very daunting thing to have to undertake. You know, I said in the talk that status quo is sort of a possibility whenever you undertake like the possibility of adding tools. I felt like at least for me, in this process, I had to remove some of my ego out of the process. Which is to say, yeah, I have like my preferred tool. But it might not work for everyone else. And you have to kind of, first keep that in mind in terms of being open to the possibility that the way that you see things, it’s not necessarily how everyone else sees it. So first that.
But then when you build on top of that, I guess it’s sort of, again, you have to be able to do the analysis of the tool that you’re selecting and how they relate to the outcomes that you’re trying to get to. The tool you’re thinking about, does it actually help you achieve the goals that you’re trying to get to, the engineering outcome, the happiness, the like I don’t want people to get paged in the middle of the night. I want people, you know, on the frontline when they are investigating emergent issues, I want them to be able to get to the underlying, the contributing factors to incidents, and so on. And you have to do that kind of, certainly there’s a cost analysis that goes into that. But like whatever tool you’re selecting, getting people onboard, you have to communicate. You have to keep hammering home the point that the tool should always be directed towards your goals and outcomes.
Corey Quinn:
I guess a related question beyond that comes from Ethan. Do you have experience implementing observability and experiences owned by teams who know little to nothing about those services? Gee, I’ve never seen a team like that. How do you make progress?
15:00
Josh Parsons:
Hmm. That’s a good question. I would say like for, I guess like if you have engineers who have switched teams or people onboarded and join a team where they own services, I think that, I guess I would kind of direct it towards like mentorship and being available to answer questions. And just, again, that engagement level of sort of understanding where people are in the process of like understanding the system that they’re owning in production, that it’s important to meet people where they are. But engage with them and be, you know, try to be as helpful as possible about — talk them through examples of problems that are particularly challenging for them. If they come to you in terms of like mentorship or questions or just — or if they talk in Slack and like I’m having a huge problem with this. Like just be willing to engage and meet people where they are in that process. I think that’s sort of universal but maybe would be useful for junior, more junior, or people who are new to services.
Corey Quinn:
Yeah, I’m assuming you’re not planning on a second pandemic so you can wind up running the experiment again.
Josh Parsons:
Nope.
Corey Quinn:
Well, I was planning for it, but.
Josh Parsons:
No, I think, I’ve had quite my fill of pandemics. But you know, there’s definitely still, there’s still like, even retrospectively with this pandemic, there’s still a lot — there’s still a lot we can retrospectively look back and process that’s still yet to be processed. And yeah, of course, we don’t want another pandemic, of course. But you know, I think again that learned bullet point at the end, always have an attitude of looking back and having an attitude of just growing your knowledge and the — actually the previous talk about the shoulders of giants, I see as far as I do because I stand on the shoulders of giants. There’s a lot of great wisdom out there for people who have been doing this a lot longer than I have. As long as you adopt that mentality, I think it will serve you well.
Corey Quinn:
A great question just came in. Do you have a take on gorilla instrumentation where your team adds instrumentation or realize after the fact just how useful it is? Like elves in the night.
Josh Parsons:
Yeah, hmm.
Corey Quinn:
My apologies if that’s an inappropriate term against the elves-ish.
Josh Parsons:
I don’t want to bail out on that answer by saying it depends, but I think that, again, if you keep the goals and outcomes insight when you’re thinking about quote, unquote gorilla instrumentation, what purpose is it serving ultimately, like are you trying to make a statement, make a point about something, look, I know I have a vision for like where this needs to go and I’m going to do and it shows you how it’s done. I mean, yes, that could be effective, I suppose. I guess my stance is that I would rather do coalition-building rather than trying to just go in my own way. No, I know how to do this. This is how we’re going to do it here.
Corey Quinn:
Right. Because — I’m with you on that. Move fool, I’ll do your job for you. And that doesn’t win friends or influence people in a positive way.
Josh Parsons:
Yeah, I think the best way for at least for my success, like the way I was able to convince upper management to bring on a new tool, was through the coalition-building that I did and the engagement that I did.
Corey Quinn:
Which is the next question, how do you get buy-in from organization and engineers to adopt new tools and the observability mind settle? It feels like the early day, the reason COVID-19 was so successful, everything is changing and we’re not sure how or why, but why you’re panicking, run this three-liner, please. You can sneak it in early on but there has to be a buy-in story later?
Josh Parsons:
Yeah, you can think of it as a snowball effect of sorts. But you just have to think of it organizationally in layers. Talk with your team, talk with your manager. Have the manager talk with adjacent engineering teams’ managers about the process. Keep it front and center. Especially when we’re doing a post-incident review, like the keynote at the top of the conference about post-incident review. If you are a person who is passionate about observability and the outcomes that it drives towards, that you know you’re going to find opportunities to talk about ways, the principles of observability and applying them to everyday work situations and finding ways to drive those conversations. You have to think about it organizationally and build it from the bottom up I think to get that buy-in.
20:22
Corey Quinn:
It’s interesting how many questions are coming back to the same exact thing which is, it’s not about the tool, it’s about the culture and how you drive change. Turns out the hard problems in observability are not that it sucks, who knew. This is a buzzwordy one, how do you move an org away to more modern observability concepts? That feels slippery struggling to pay and hard to grab on to. And you don’t move away from three pillars if your entire marketing campaign is built around them, but all right.
Josh Parsons:
I mean, yeah, you’ll notice that the only time I mention the pillars was sarcastically because I understand the plausibility of stuff like that. But nonetheless, I think leaning on white papers, blog posts from people in the industry who know what they’re talking about, and actually sharing the wisdom from within that, you know, like that sort of doesn’t need to use those terms. Yes, all the logs, all the traces, all the metrics. Like if you know, if you have read or watched talks from people who know what they’re talking about —
Corey Quinn:
And certainly never given one like that. Go on.
Josh Parsons:
But I would say there’s a stark difference between the writings of people who are practicing observability that don’t need to depend on the three pillars. I would say like, the literature on that doesn’t need to depend on it, you can still sort of talk about observability, I think you can still talk about observability without having, without necessarily having to resort to the three pillars. If it distills it down into like a way that, as an easily understandable way for people, that’s one thing. I get that. To some degree. But — sorry, I lost my train of thought. But I think the literature blogs, talks, you can get a lot out of that and share that information to get the same point across.
Corey Quinn:
Thank you very much for taking the time to take us through your experience and get to the great talk. Really appreciate it.
Josh Parsons:
Thank you so much.