Webinars Observability CI/CD Pipelines
Nailing the Deployment Part of CI/CD with Honeycomb
Summary:
Everyone says they focus on CI/CD, but how many of those teams are really doing that deployment part well? Many teams are excited to start down the path to continuous deployment until they hit snags, struggle to fix them, and eventually spend more time maintaining their build pipelines than they spend delivering software. Don’t settle for just the CI part, clear a path to CD by adding observability to your build process. Honeycomb’s Charity Majors (Co-founder & CTO) and Pierre Tessier (Team Lead, Solutions Architect) show you what it’s like to have observability in your builds and why that’s a key part of what lets you ship code to production often and quickly. In this webinar, we'll also look at Honeycomb's own internal build process to see how we used observability to cut down our deployment times by 30%. Learn how to: - Apply distributed tracing concepts to their CI/CD pipelines - Optimize those pipelines using Honeycomb Build Events - Use a FREE Honeycomb account to speed up their builds
Transcript
George Miranda [Senior Director of Product Marketing|Honeycomb]:
Hello, everyone. Welcome to Nailing the Deployment Part of CI/CD With Honeycomb. We’re going to give people just a few minutes. So we’re not starting just yet. Hang tight, and we’re going to start this webinar promptly at 9:02. In the meantime, can everyone see and hear me? If you can, write me a note in the chat and let me know. Excellent. Looks like folks can hear us. Today’s walk-on music was selected by Pierre. It’s a great way to spend these first couple of minutes. All right. We have quite a few people joining us. It’s 9:02. We’re going to go ahead and get going.
I’m particularly excited about today’s content. I hope you are too. We’re very much looking forward to hearing from our speakers. Before we start today’s webinar, we’re going to cover a few housekeeping items to keep in mind throughout today’s presentation. First, this webinar is being recorded. After we complete today’s content and the recording is processed, if you are registered to attend, you will receive a link to watch the recording. So if there’s anything you miss, or anything you would like to refer back to, click that link to watch this webinar on-demand. Next, we’ll be taking questions at the end of today’s presented content. You can ask questions at any time by clicking the Q and A box at the bottom of your screen. We encourage you to ask questions, as they come to you, and we should have plenty of time at the very end. Last but not least, today, we’re joined by Kimberly from Breaking Barriers Captioning who will be providing live captions throughout this webinar. To follow along with the live captions, just click the CC, or live transcript button, and click “view transcript.” That’s at the bottom of your screen. Let’s get today started. Nailing the Deployment Part of CI/CD With Honeycomb with Charity Majors and Pierre Tessier. Charity, would you like to kick us off with introductions?
Charity Majors [CTO & Co-founder|Honeycomb]:
George, you’re so formal and official-sounding. It’s too early for that. Hi. I’m Charity. I’m the CTO and Co-founder of Honeycomb. If you don’t follow me on Twitter, you may do so. And I’m joined by Pierre.
Pierre Tessier [Team Lead, Solutions Architect|Honeycomb]:
Thanks, Charity. I’m Pierre Tessier. I’m a Solutions Architect with Honeycomb. I don’t have near the following that Charity has on Twitter.
Charity Majors:
They’re mostly bots, Russian bots.
Pierre Tessier:
Sure, something like that. I think today, you know, Charity, it’s great because today is like the culmination of a Slack message you sent me some time ago. You asked me flat out, Pierre, what do you think of CI/CD? And it was an interesting conversation. It started off a whole thing. I think it’s kind of funny. The first question that really came out was: Well, isn’t everyone already doing CI/CD? Isn’t that a thing we all do and it should be ingrained? Everyone says we’re working on observability, Charity.
Charity Majors:
Well, we sure as hell talk about it a lot. We all have blog posts. We pay money to CircleCi. So clearly we’re doing CI/CD. We go to conferences. Everybody talks about it a lot, but, no, almost nobody is doing CI/CD. I mean there’s an argument to be had about continuous delivery versus continuous deployment, but, you know, the whole point of CI/CD is getting it deployed. Right?
Pierre Tessier:
So what you’re saying is maybe we should be focusing on the actual payoff itself?
Charity Majors:
I think that the important thing is the interval of time between when someone writes the code and when it goes live. Right? Because of that interval of time, that building block is kind of foundational. It’s about making it short enough that it can be muscle memory. You write some code, you’re instrumenting as you write it, you merge to main, and then you go look at it. And you look at it through the lens of your instrumentation, and you ask yourself, Is it going what I expected it to do? And does anything look weird? When people look at it within minutes after they’ve written it, they’re going to find bugs. You know, 80% of all bugs, they’re going to find then and there.
The longer it goes between when you write the code and the bug is found, the harder it will be to find, the more likely your customers are going to find it. Few of us are paying attention to keeping our tests parallelized and quick and short and everything that goes with getting it out the door in 15 minutes or less.
6:13
Pierre Tessier:
I think you hit on something right there, about the longer it takes, the worse it is. It’s like that contact stitching thing, right? Like, I just wrote some code in my memory. I’ve got it here. I’m going to hit the deploy button. I’m going to go grab a quick cup of coffee, take a quick break while that thing is doing, but when it hits production, I know what’s in my head.
Charity Majors:
Yeah. When it’s been days or weeks, you’ve long since paged that out.
Pierre Tessier:
It’s like two years, like two hours just to set up my dev environment. What branch was that again? What did I have to do?
Charity Majors:
The thing is the longer that interval becomes, the more pathologies creep in. Having it be that brief enforces a lot of really good behaviors. Really small diffs. Quick reviews, forwards and backward compatible migrations. You know, there are just all these behaviors that are doing deploys such that you’re shipping one engineer’s single set of changes at the same time, not bunching them all together. As soon as you’re up to an hour or more to deploy, I guarantee that you’re going to start bunching together diffs. You’re going to start, you know, shipping artifacts with not just one engineer’s small set of changes but, like, two, three, four, five, six, seven, eight, nine, ten. Now you have decoupled the fundamental contract of engineers building their services. If I’m getting paged, you end up feeling so much less responsibility for that.
Pierre Tessier:
Which probably leads to this next big one right here. So when we grow our teams…
Charity Majors:
It’s the opposite.
Pierre Tessier:
When we have more people doing more things, it’s actually harder to deploy as well because now we have more people that got in there.
Charity Majors:
This is why it has to be automated. Right? This is why you can’t rely on humans to go and push the deploy button every time to make sure they get it in before another diff gets merged. The whole pipeline has to be automated so when you merge your code to main, it kicks off a full CI run, generates an artifact, and then deploys that artifact. You cannot have any human gates in there or you’ve broken the contract. You’ve made it unpredictable, which means you can’t rely on engineers, you can’t count on them to look at it through the lens of their instrumentation and own their own software in production.
Pierre Tessier:
That’s a big one, right? It’s the automation part.
Charity Majors:
It’s huge. And there’s this death spiral that teams get into. Because it takes so long to deploy their code, they’re batching together a bunch of diffs. The amount of time is variable and long. So nobody is looking at their code in production after they’ve written it. Literally, nobody is looking at it. So all kinds of bugs are getting shipped and just lurking out there.
You don’t know when your code is going out. You don’t know who’s responsible for the changes. You have no accountability or autonomy. Your engineers are going to spend a way higher percentage of their time waiting on each other, you know, because the diffs get bigger because you’ve got to get more out at once, the diffs get bigger so the reviews take longer. It starts compounding upon each other, and soon you’re going to need, I mean, these deploys like, when you’re shipping a whole bunch of people’s changes at once, they get worse. News flash. They get flakier. They go down more often. Pretty soon you’re going to need, like, an SRE team just to deal with the fact that, you know, if you have a failed deploy at 2:00 p.m., the rest of your day may be shot. Right?
Whoever the VP is isn’t going to want their software engineers to be biting the bullet. So they’re going to hire, you know, SRE teams and ops, QA, build engineers. Now you need more managers, more TPMs. You need more software engineers to write the code. Everyone is spending more time waiting on each other than writing the code and looking at it. And before long, you’re a big, expensive company. This cost is real. Like by the envelope calculations, which are completely made up but also not.
If you’re shipping your code automatically in 15 minutes or less, let’s call that N. That’s how many engineers you need to write and run it. If it’s shipped in hours, you need twice as many engineers. If it’s shipped in days, you need to double it again. If it’s shipped in weeks, you need to double it again. So what you could be running and writing with 10 engineers with a 15-minute deploy loop takes what? Lots of engineers.
Pierre Tessier:
Doubling a few times. We’re up to two hours.
Charity Majors:
Yeah.
Pierre Tessier:
Which is interesting, right? And it makes a good case for, it makes a case for smaller, nimble microservices managed by smaller, nimble teams focused on just their service, just their part, able to push the deploy button, see it hit production 15 minutes later. Still in that context mode. My ID is still set up with what I was working on. I can go look at it in production, and I don’t have to worry about…
Charity Majors:
And you don’t have to break the flow. Right?
Pierre Tessier:
Yeah.
11:45
Charity Majors:
The thing is, like, that when it comes to software, speed is safety. This is counterintuitive to us as humans because when we get nervous or anxious, we freeze up. Right? But that is exactly the wrong response. You need to think of it more like riding a bicycle or maybe ice skating where the slower you get, the more complicated your balance gets, and the hardest it is to stay upright. The quicker, the faster it is, like, this is what I’m talking about here. This is not difficult. This is the easy way to write software, it’s when you have this tight feedback loop.
The hard way is what people are doing right now. You know, it gets slower, it gets harder, it gets more complicated, you need more resources, so they have side effects, and they create their own complexities and surface areas. The easy way is to strip that away and be disciplined about keeping it to 15 minutes or less.
Pierre Tessier:
I like that. Fifteen minutes or bust. And this is the point, right? This is the point that we’re all trying to say. CI is fine. CD today is broken.
Charity Majors:
It’s great. CI is wonderful. Yes.
(Laughter).
Pierre Tessier:
Because we’re not achieving CD in 15 minutes.
Charity Majors:
We’re not achieving CD at all.
Pierre Tessier:
Let’s be clear. We’re talking about deployment here, not delivery. It’s about seeing it live in production. It needs to be deployed. There it is. I can see it.
Charity Majors:
And this doesn’t mean having users automatically running the code. It’s about decoupling your release and deployment. It’s about making your fast and continuous releases, your deploys that are happening many times a day, you’ve got feature flags, right? You’ve got some mechanism in place where you can release that code to users in a controlled way that’s completely separate from shipping your code to production. Shipping your code to production is basically hygiene. Now, you have other considerations when it comes to releasing to users, all at once, subsets, canaries. Decoupling those two is super key.
Pierre Tessier:
Right. I think you may have hit on something right there, as well. Your users don’t have to be running your code for it to be deployed. It can just be there. You can hide it behind a feature flag that you can test to ensure it is doing what you intended to do before you open the canary up to the world.
Charity Majors:
Yeah.
Pierre Tessier:
This is the point of 15 minutes. This is the point of the small feedback loops because there is no such thing as staging and being prod data. There’s only one spot where prod is.
Charity Majors:
Only production is production.
Pierre Tessier:
There is no other spot. I think the real success of how you make CI/CD work, how you make the CD part work, it’s right here. It’s being able to be done from the time I’m committed to the time that it’s in production. If you can keep that within 15 minutes, a cup of coffee, going out to go get one, then we’re going to be able to have…
Charity Majors:
It becomes muscle memory. You do it without thinking. You look at your code in production. You watch users using it. You make sure you haven’t broken something hugely adjacent to it. Then you move on with your day. It’s so much saner than saying, Okay. My code is going to be deployed by someone in the next day or week. I don’t know. That’s not, that’s not how good engineering teams function. You know? And what we see is all of these pathologies kind of fall out of this long deploy loop. You know, a lot of people, a lot of really great people spend time fixing them, trying to enforce smaller diffs, trying to get people to turn their code reviews around quicker, trying to get people to care about operations, trying to get people to do all of these things.
But they’re expending so much energy doing it the hard way, instead of focusing relentlessly on the length of time between when you write it and when it goes live. Fix it at the source. Don’t worry about the symptoms. It’s like if the patient is wandering into the emergency room and spurting blood from all of their orifices, oh, let’s mop up after him. Well, let’s get the patient to stop spurting blood. That may be more effective and efficient here.
Pierre Tessier:
We do this at Honeycomb. I remember one time I was coming to use our product, and we were working on the progressive shifting of the buttons in our query window. Things happen. The browser I was using, the way I was using it, the size of my browser I had at the time triggered a condition that we didn’t catch. But I happened to see it. Right away, I clicked on it and thought, Did something change? The engineer who committed that code was 10 minutes away from committing the code. He said, Let me work on it. Literally, I went to get a bite to eat and came back to my desk, and we were fixed. That’s 15 minutes or bust. The engineer behind it had all the context, knew exactly where to go look, was right down with code in minutes, and we’re pushing PR to get fixed. This is why 15 minutes or bust is so important, everybody. It’s that muscle memory.
17:32
Charity Majors:
Yeah. And, like, we talk about this in the ways of how important it is for the business and for the users and everything, but, like, I also want to make sure we don’t lose sight of the fact that this makes engineering a whole hell of a lot more fun. You know, shipping lots of code doesn’t burn engineers out, not shipping is what burns engineers out. Being decoupled from your work and the impact of your work is what burns people out. The more you have a tight feedback loop, it’s like you get the payoff. You get the dopamine hit. You’re moving so quickly and with so much speed and control. Right?
You’re not out of control, just like Wee! But it’s like, you know, being in a fucking Ferrari or something instead of just lumbering down and being blindfolded in your wagon. I don’t know. It makes your work so much more satisfying and fun. I can’t say enough about… This is why anytime you talk to an engineer that’s worked with continuous deployments before, they’re unwilling to go back. This is what they’re looking for in a job from now on. It’s like a whole different profession. It’s life-changing.
Pierre Tessier:
It’s like smiling every day instead of one big smile every three weeks. Right?
Charity Majors:
There’s the whole Dilbert comic, thing, you know, just like about how frustrating computers are and (growling) anger. Yes, anger. Obviously, I love anger. It’s my favorite emotion, but it has to be anger coupled with joy. Right? Not anger coupled with frustration and torment and futility, which is what a very long, fake CD process inspires in me.
Pierre Tessier:
So we’re saying 15 minutes or bust. We’re saying do this. Now somebody is going to say, you know, Why is that so hard?
Charity Majors:
It’s hard to generalize because we don’t know where people are coming from. I will say it’s not hard, technically. It’s hard politically, usually. Technically, it’s like, okay, if your CI process takes a couple of hours to run, you just start chipping away. It’s just engineering. It’s just engineering. But it’s engineering that probably every single person listening to this webinar can do. But you need the time. You need to get the time devoted to it. You need to convince people that it’s not scary and that it will actually be better and it’s worth sacrificing some reliability in the short term to gain more in the longer term and some time away from features. Right?
And, fundamentally, don’t get me wrong, I think the failure to adopt continuous deployment more widely is the single biggest failure of our technical leadership class of the past two decades. Just like fall down flat on our fucking faces failure of political will to, like, get this through. But the way to start changing this is, I believe, to make the arguments to the people who make these decisions, but don’t just cast it in terms of engineering things. Like, Oh, this would be a better CI/CD pipeline, blah, blah, blah, blah you know, terms and everything.
Talk to them in terms of dollars and people and attrition and users and, you know, the speed of innovation, and the amount of time that you’re… like, pull up the Stripe Developer report where they show that engineers spend about 42% of their time just fucking wasted. It’s not going to move the business forward. It’s going to, I guarantee you, almost entirely pathologies related to a long, continuous CD interval. Or pull up the DORA report. The best ways to make progress on those four key metrics are: Start with this fucking interval or get someone to read Accelerate. The more people you can get to read Accelerate. That’s the long version of this webinar. Read the first half of Accelerate. This is the best collection of data, like real data, that we have about what makes engineering teams high performing and effective and efficient and delightful to work on.
Pierre Tessier:
I want to come back. Early on, you talked about how it’s not hard technically. You’re right. It’s hard politically. You need buy-in from leadership to make this work.
Charity Majors:
Yeah.
Pierre Tessier:
And if we don’t have buy-in from leadership, it is going to be a struggle. This is where…
Charity Majors:
But if we did…
Pierre Tessier:
We have to show there’s a lot of value here, and it may start off a little rocky, like all the big changes you’re doing, you’re changing your methodology and changing your approach.
Charity Majors:
Oh, I can’t forget to say if you’re introducing CD to a very mature organization repo that has not had CD for years and years, it’s going to be somewhat challenging. But do you know what’s really fucking easy? Start off that way. Don’t fall into the pit. Make a commitment to any new process you speed up, make your build pipeline on day one, have it completely artifact, and that will just become as natural and easy as breathing. It’s like air. Pretty soon, people will be questioning, Why can’t everything be this easy? And that will put pressure on a lot of people to change the other stuff. It’s easy. This is the easy way to write software. I can’t say that enough. So, Pierre, we’re here. We are a technical product. If you do have the buy-in, where would you start?
Pierre Tessier:
I don’t know, probably the build pipeline.
Charity Majors:
Instrumentation and observability.
Pierre Tessier:
That is kind of fun because that is how we probably started here a while ago in this.
Charity Majors:
Yeah.
Pierre Tessier:
We’ve always done the deploy to production. Right? It’s a bash script when we first, at Honeycomb, started, but we matured.
Charity Majors:
Yeah, it was a bash script. I literally just wrote a bash script to, like, copy the artifact from CircleCI out to production. Now it’s a slightly more sophisticated bash script.
24:00
Pierre Tessier:
A little bit more.
(Laughter)
But it did work. A few years ago, we started applying our principles to the pipeline. What we started doing was we started thinking of: What is CI/CD? Can we gain more knowledge from it? Well, CI/CD, a pipeline, it’s basically running bash scripts. You’re going to run Go tests. You’re going to run Go build. Run JS tests. You’re going maybe run some other linter of some kind, and then you’re going to compile.
Charity Majors:
It happens across time. What else do we do that we visualize by time?
Pierre Tessier:
Hmm. It sounds like a user transaction who wants to come in and make a request and it bounces from a couple of different services and then gets turned back.
Charity Majors:
It sure does.
Pierre Tessier:
Well, while we’re here, let’s talk about how we achieve that. We did that at Honeycomb. We said, Why don’t we treat our builds like a distributed trace?
Charity Majors:
Traces.
Pierre Tessier:
And this is one of the very first ones we did from many, many years ago, 2018. This is one of the first traces that we actually went off and we traced it, or one of the first builds that we went off, and we traced it. It was simple back then. This thing here called Build Events is what we actually created that helped us do that. We did some Poodle. You’re going to see the words Poodle, Doodle, Retriever, Shepherd. These are dog breeds.
Charity Majors:
We’re a dog company.
Pierre Tessier:
We love dogs. Puppies are great. So our services at Honeycomb, they’re named after breeds, fittingly enough. Poodle is a show dog. So Poodle is the front end if you’re wondering what that means there. And Retriever lets you go get all your stuff. Retriever is a query engine for Honeycomb. But we can see all these steps going through, all the way down to the Go install. It was almost like a piece of magic that happened. We learned a lot. What did we learn? Well, we learned that Poodle build and Go test seemed to take a lot of time. That makes sense. Our code grew. Do you know what else we learned? We learned that there was nothing wrong with this. We learned that it was fine because now we have a knowledge of what was going on inside our builds, and though somebody had an inkling that our builds were getting along, yeah, 400 seconds is fine. They seem to be doing what they’re supposed to.
And we left it alone. We just let it run. We kept the instrumentation going with the occasional checks on it. You know what? That’s kind of one of the good things you can do. You can do nothing. That’s an okay decision to have. And we did have that one. A year goes by. We’re still instrumenting. Those builds went from, you know, four, five, six, seven minutes, and now we’re up to 14, 12, 13 minutes. They were getting longer. But we had the data. We could draw this out. We could see the trend over time. And this is a year of data we’re looking at. And we could start to understand that. Okay. Now our builds, now our pipeline, our ability to go from code to deployment, we’re pushing the 15-minute threshold. We’re getting tight on it. This is just the CI part of this. We wanted to start adjusting it.
We started looking at what our builds were doing. We can click on this. We could see what was happening inside that build itself. We were spending time on this script thing. We learned about it. And we made the decision to change CI platforms. There’s a lot of reasons you do these things. I’m not going to say one CI platform is better than another. Every organization is different. Every organization has different needs and wants to look at things their own way. We decided to search for a different platform because we felt if we could parallelize these steps, we could achieve better build times. So we used containers instead. So you can see the switch-off happen right around here. Now, there’s a lot more variance involved when you do builds by container because, you know, the start-up time. We knew that variance would happen. What we really care about is the average duration. We’re looking at the heat map here.
Because we have that data, let’s go ahead and pull up an average duration of how long these builds took. And we could draw that line and see that, yeah, we did actually save probably 30%. And that’s great! We’re winning. We’re winning what we’re trying to achieve. And we could continue to look at this in various different ways because we have this data. So I could say, Hey, group this by name, for example. Name, all the different steps we have inside of our build. Now I can see what steps are taking too long. I can see here Go test is clearly our biggest one. Our node stack is mostly Go, but we saved a lot on Poodle build. That got paralyzed quite a bit. We could see the drop that happened. I can hover over these lines as well and see the same effect there also. Because we instrumented our builds, because we had this information, we’re able to see what is going on. We’re able to understand where to optimize next.
Charity Majors:
The first start of the engineering effort is to measure it, right, so you’re not just driving in the dark.
29:40
Pierre Tessier:
That’s exactly it. Right? You’ve got to know where you’re going before you go there. And we can go look at one of these builds. I will click on one of these here. These are more recent here. You can see a lot of steps running in parallel. I can even see what branch it came from. We have the syntax user dot and whatever the branch is. So Molly here worked on this, and that was her branch. We have all the information, we have build IDs, and all this stuff in here to understand what we need to do. And that’s what happened, right? We optimized the build, moved to a different CI platform, things were faster, and we’re back to achieving the 15 minute deploy to production thresholds that we’re trying to achieve.
Time goes by, we grow, and we’re engineers. We hire. We add more features. I have a saying I like to say, disrupt or be disrupted. If you’re not adding new features, your competitor is doing it for you. As such, our builds got longer. This is the world we live in. But what’s great is we did more with our builds. We started adding more markers to our platform as well. So, here, this is Poodle. This is looking at Poodle. It’s early right now in the morning. So the engineers are not making builds right now. But you can see here every hour we push what is in there. We make sure it’s pushed, no matter what. And I can go into any of these, and I can click through it. I can learn more about it. I can go all the way down to the actual commit message of what we actually did right here and get whatever that was from Doug, who did something here to do a change.
But this is another important part. It’s not just instrumenting your builds. It’s projecting your deployments on your observability tooling. So you can see those changes and click on them and learn more about them. I can probably even link back for a trace for that build if we really wanted to get detailed like this. So, as I said, time moved on. We were learning more things. We were going further. And we were noticing things are getting a little slower. So we’ve got this wonderful engineer on our team. His name is Dean. I love Dean. He’s so great. He had the idea, I can make these faster. He optimized our data. By understanding we were spending a lot of time in our JS test, even though we were paralyzing it, accumulatively, they were too much. He said, why don’t we use RAM disk instead and we improved our build times significantly by doing this.
So here is a chart. I think it’s nine months or so that we’re looking at, maybe a little bit more or not. And I drew a maker on here so you can see when that commit actually happened. And you can see there was a drop, a drastic 30% drop in our build times. And this is part of what you want to do. You want to measure. You want to put that observability in there. You want to understand it. Now, I don’t need you to go look at this every day, but it’s good to understand it. Your CI platform is your compute. Whether you host yourself or using somebody else’s, you have to compute.
Charity Majors:
Or you can set up a trigger for yourself to let us know if this ever exceeds 15 minutes.
Pierre Tessier:
We actually have an SLO for that. If it matters to you, you have to put an SLO on it. Right?
(Laughter)
I can see this. I can see what’s going on. I know not long ago, Charity, we did a big push. We actually released this on Monday, right? The queries, annotation save queries?
Charity Majors:
Yeah.
Pierre Tessier:
I can see a little thing right here. I can see more builds. I’m going into BubbleUp mode. For those who don’t know, BubbleUp is this awesome feature in Honeycomb that says, I see something in the chart. Tell me about that.
Charity Majors:
It’s weird. Tell me why.
Pierre Tessier:
Tell me why. So I’m going to draw a little yellow box around that bump right there, and it’s going to pull things out about it. All of these builds are unique. We’re able to group down. Build numbers are all going to be different. But I can look at different things. I can see what repo is standing out. Owner. Okay. Toshok. We have a saying at Honeycomb, Toshok has a branch for it?
Charity Majors:
Toshok has a branch for that.
Pierre Tessier:
I’ve got all these insights fast. I can know who is doing what, what are they doing, are they doing it right?
Charity Majors:
It computes, like, what is different from inside the bubble from the base around it. Whether that’s one thing or 20 things, which is so mind-blowing after you’re coming from a world of metrics where you can’t do that kind of correlation because you stripped all that context away.
Pierre Tessier:
And I could even take this and say, Hey, what else is Toshok doing? These are only Toshok’s builds.
Charity Majors:
Jesus Christ.
Pierre Tessier:
He does a lot of builds. We may want to have a chat with him.
(Laughter)
But we continue that. I can drill it into and look at the trace behind that build and learn more. So all of these capabilities, we call them Build Events. It’s a service we created. We make it available to our customers out there. We have a repo for it. Pardon me, I did not have the repo tab ready and open. So I’m just going to pull it up here real quick. GitHub.com/Honeycomb.io. I know it’s called Build Events. You can go here. Learn more about it. We have a lot of different instructions on how you use this. You put this on your CI pipeline. You bring it all the way to deployment, and you can start visualizing your builds. You can start measuring that 15 minute or bust and start to achieve it.
Charity Majors:
A lot of people, when they roll this out for the first time, there’s a lot of shocked looks. What? We’re spending how much time on what?
(Laughter)
Pierre Tessier:
Always. And this is the thing where you look at the value versus the cost. The cost, you can do this with a free account.
Charity Majors:
You can totally use the free tier.
Pierre Tessier:
Absolutely. But the value you gain is massive. So we definitely recommend it to a lot of people.
36:15
Charity Majors:
It’s a really fun way to get started with Honeycomb. There’s literally no way you can break anything in prod while you’re playing with everything. It’s a free tier. It’s a permanent free tier. It’s not like you’re going to have it just for 30 days or whatever. But, yeah, it’s incredibly valuable.
Pierre Tessier:
That’s exactly it. We also have integrations with GitHub actions. We use CircleCI. Build Events itself will integrate with just about any pipeline in the world. Someone wrote a Jenkins plug for it. It’s a universal thing that’s used to wrap every step in your pipeline and amid a span as part of a long distributed trace. So you can really gain those insights.
Charity Majors:
Cool.
Pierre Tessier:
So with that, I’m going to switch over to this. Sorry?
Charity Majors:
Cool.
Pierre Tessier:
There we go.
Charity Majors:
I’m just agreeing with you.
Pierre Tessier:
Definitely instrument your pipeline for free. You can do this on a free tier of Honeycomb.
Charity Majors:
Surprise your team today!
(Laughter)
Pierre Tessier:
And definitely go ahead and try it yourself. So you can go get it at Build Events. You can Google search for GitHub actions. It’s available there as well as for CircleCI. If we don’t have your preferred CI platform of choice, you can find us on the Slack Pollinators team. You will get that when you join Honeycomb. Go ahead and ask us. There’s a channel out there for you about it.
Charity Majors:
Yeah.
Pierre Tessier:
With that, I think we have 20 or 15 minutes left. I think it’s a good idea to open us up for questions. Those are all the URLs. You can go ahead and get the GitHub actions in CircleCI. I will hand it off to you, George. Let’s see what we have.
George Miranda:
Yes. That’s a great context. I’m back with my official tone, but it’s 40 minutes later, so maybe it’s not too early for that. I love the walkthrough. Just to reiterate, this is one of the things you can do with a free Honeycomb account. Like, the usage that you would get is well within what the free tier supports, and it’s a really easy way to get started with observability. We’ve got a ton of interaction. There’s a lot of great stuff in the chat. So we’ll start making our way through those. Again, if you haven’t submitted a question and you have it, use the Q and A button at the bottom of the screen, and we’ll try to get that. You talked about using feature flags to deploy features and releases. Are feature flags necessary or is having observability enough to know what’s happening when you’re releasing features?
Charity Majors:
It depends. Who are your users? What are your products? If you’re a bank, you should have feature flags. You know, we didn’t start out with feature flags. We added those later. We didn’t start with them. So it adds to your appetite for risk and your engineers’ operational maturity level, I guess. You know, I guess the thing that having feature flags lets you do, though, is not have to be so strategic about, when am I allowed to merge my diff and making it so that people don’t have to think about, okay, it’s a good time to merge this diff or not means there’s not going to be enough time for those diffs to grow and grow and be monstrous. No, they’re not absolutely necessary. For God’s sake, we’ve only really had feature flags in the last couple of years. But I think you’ll pretty quickly find yourself craving them a bit.
George Miranda:
Nice. Okay. There’s a follow-up question along the same lines. If the concept of release is going to change from the way that we think about it currently… so, basically, is a single feature, if tested, automatically otherwise, like, ready to go? Does it just go right into the destination? Does it just go straight to prod? Is that what we’re talking about?
Charity Majors:
As opposed to what?
George Miranda:
As opposed to, I guess, the way we think about it today. I don’t know. The question isn’t very clear.
Charity Majors:
Are they talking about are we going to have staging environments to promote them from or?
George Miranda:
Yes. And there are actually quite a number of questions about that. Basically, and let me pick one of the ones that are related. Is there any place in your vision about lower environments? Like, what does this mean for dev, QA, staging, et cetera?
Charity Majors:
Yeah. So despite my reputation on the Internet, I am not antistaging. My point has simply been that I think we need to invert the priority. I feel like too many engineering teams have spent, like, you know, so much of their cycles on, you know, trying to replicate production data in staging, you know, generating a different dev environment for every engineer, all these elaborate setups. So when it comes to production, we’re out of time. We’re just going to let it be there, barren. I’m just saying production first. I think we’re seeing a big gravitational shift right now in the last four or five years, shifting a lot of attention away from the pre-production hardening and towards just giving you more tools and more fine-grained abilities in production.
That doesn’t mean there’s no role for staging. In fact, Honeycomb has a couple of staging environments. And the way our autodeploy works is, when you merge, it autodeploys to staging. And then, after a while, if, you know, if things are fine, there’s automated checks, as well as people using it throughout the day. Then it promotes it, and it deploys to our Dogfood environment, which is the Honeycomb for Honeycomb. After a while, it then deploys to production. So, you know, there’s a promotion path there that goes on that, you know, helps us gain a lot of confidence. You know, your mileage may vary. Only you know what you need. But, you know, I guess the point is that your job is not done when you’ve deployed to staging. Too many people are like, I’ve deployed to staging. Now I have CD. And, no, that’s not actually true.
43:20
George Miranda:
Nice. There are a couple of other environmental related questions, but I think they’re slightly different, so I’m going to pivot a little bit with our next question. I love this one. Once your team is already in a slow-release cycle, what suggestions do you have to start reducing that time down from weeks to something much faster for deploy?
Charity Majors:
Weeks? Oof! First of all, if you’re spinning up any new projects, any new services, any new repo, or whatever, make sure you get in on the ground floor and start your CD from day one. The difference in how delightful it is interacting with that service versus the pain of the other will in itself create a lot of pressure for change because it will no longer be hidden from everyone how miserable their lives are. It sounds to me like there are a lot of human gates involved, probably humans that have to sign off on things, humans that are doing manual tests for things. And I don’t know what your title is or your job position. Any one lone engineer is going to have a hard time bucking the system and improving it. I think I would start with trying to spread, you know spread the good word. Give them the sacred text of Accelerate. Like, leave copies on everyone’s desk. Just get everyone thinking about how much better it could be.
This is a political problem. The engineering this takes is trivial, but it’s convincing people to let go. It’s convincing people to participate in a brave, new world that is the difficult part. It’s difficult to be more specific without knowing more things about your environment. I would be happy to chat about that, you know, if you wanted to hit me up on Twitter, DMs, or something.
George Miranda:
Sweet. Those are all really good places to start. There’s a couple of variations of this next question. I think it’s worth clarifying a bit here. And so this particular question is worded as: So 15 minutes to production. What does that mean for peer reviews where automation is not possible? Where do peer reviews fit into that picture?
Charity Majors:
Absolutely. You write your code. You get it reviewed. Then you merge.
George Miranda:
So we’re talking about once it merges, then it needs to go?
Charity Majors:
Yes. Yes. Yes. So everything between when you merge and when it’s live is what is automated. But, absolutely, get it reviewed beforehand.
George Miranda:
Awesome. I love this variation of the question which is really around what you’re exposing and sort of the risk for deployment. So the question is: Do you still handle the timing of releases during peak usage, like on a Tuesday, if something breaks, you might have thousands of customers that are affected versus doing it on a Saturday morning where if something breaks, you might expose the problem to fewer users. Does traffic change how you approach releases?
Charity Majors:
I can think of maybe some very rare and unusual scenarios where it might, but the whole point of this is deploying has to be ordinary. It has to be unremarkable. Shipping software should not be scary. It should just be part of your everyday life and not something you think about a whole lot. It should just go live. You know, there’s all this trust that we’ve built, social expectations and, you know, don’t deploy on Fridays! Like, shipping software is the heartbeat of your company. If you’re a technology company, it should be just that boring and unremarkable and reliable.
There’s something really wrong if you have to special case deployments on a regular basis. For example, to be clear there are ranges of danger. Right? In general, the more you get to laying bits down in diffs, the more risk averse you should be. If you’re doing a database major version upgrade, there’s a real argument there for doing, you know, maybe capture 24 hours of traffic and replay it onto an identical cluster while tweaking some variables in the config file or whatever. Right? So that’s the kind of paranoia that you can adopt.
For some storage changes at Honeycomb, what we’ve done is when we added compaction or compression, what we did was first shipped a change that did nothing but instrument a lot. So it gave us the ability to just analyze our instrumentation to see what changes would happen and what the impact would be on users? How much would users reclaim? Would their footprint grow? So we shipped this instrumentation. We let it bake for a couple of days, and then we started shipping changes that were reversible and partial. You start thinking about shipping your code differently, in smaller chunks that are controllable, reversible, and much less scary. These should not be big-bang releases. They should be just normal, let’s get a little bit closer today than we were yesterday.
George Miranda:
Nice. This is a really good follow-up to that.
Charity Majors:
You’re muted, George.
George Miranda:
Am I?
George Miranda:
Can you hear me?
Charity Majors:
I can’t hear you.
Pierre Tessier:
I can hear you just fine.
Charity Majors:
Oh, I can’t hear Pierre either.
George Miranda:
Oh, well then, that’s Charity’s volume. Hello? Can you hear us, Charity?
Charity Majors:
Oh, I see what happened.
49:32
George Miranda:
All right. Yay! I was saying a good follow-up is: There were some variations of this question as well, which is basically like, Hey, let’s get into that 15-minute time frame. Right? Like, what are we really talking about here? Specifically, one of these questions is around sort of the time frame of what it means when you’re deploying at scale. Let’s say, as part of this, you have to replace hundreds of pods in your infrastructure, right. How does that really impact that 15-minute window we’re talking about? Are we taking everything into consideration?
Charity Majors:
Yeah, so we’re talking about that 15 minutes. I only count until the first node goes out. Right? Until your canary goes or until your pod begins rolling. The length of time it takes to roll your info or deploy it fully doesn’t matter. You can be as careful and cautious as you want. It’s a good practice if you have the ability to ship to one canary and have the engineer look at that canary and slowly promote it to larger percentages of your system. That’s a great best practice. It doesn’t count within the 15 minutes whatsoever. It’s all about the human, right? It’s about the person sitting here writing code, being able to, you know, expect that their code is going to be there live within minutes so they can go use their muscle memory to go look at it. That’s true if it’s one node or 100%.
George Miranda:
All right. Let me ask a couple of quick ones, I think. So, one: Earlier, you mentioned Accelerate and you mentioned the Strike Report? Was there another report that was mentioned in that? Someone didn’t seem to catch what all the references were.
Charity Majors:
Just the DORA reports.
George Miranda:
Okay. So DORA being DevOps Research and Assessment, the folks behind the Accelerate books. I think that is the totality of what we’re talking about. Great. There was another one in terms of support, so I’ll lump these together. Build Events that you showed, does that work with GitLab CI/CD? And, hey, if my build pipeline is built-in Windows using a .NET environment, are there libraries to support platforms like that.
Pierre Tessier:
I can take that. So Build Events is a small Go library, a small Go program I should say, that’s meant to be run by bash, or through script. For GitLab, I don’t know specifically, but I’m confident running script commands effectively. So instead of running your standard command like Go test, you’re going to run Build Events and pass in Go test as an argument for Build Events. And that’ll take care of doing all the wrapping for you. It can work with any of the pipelines out there, for the Windows .NET 1 equally. It’s going to run from the… I’m assuming this is a PowerShell script of some kind that’s running. So it’ll just run as part of that script.
George Miranda:
Nice. I really like this one actually. The question is, or the comment is: I liked your point about the first code before shipping a reversible change. How does Honeycomb reverse changes? Right? Like, what do we do when things go wrong? Feature flags, perhaps?
Charity Majors:
We use feature flags a lot. There are the occasional changes that we need to, you know, ship new code to fix. In general, most changes… we use SLOs extensively. It’s usually more about burning down a budget than everything’s down. Right? We generally fail forward rather than rolling back because it’s fast and easy. You know, our engineers can usually figure out, just like Pierre was saying, you just wrote the code, it’s fresh in your mind, you write a fix, you merge it, and you roll it out.
On a rare occasion, I think that we have done rollbacks. Those are usually when something is catastrophically wrong and we’re really not sure what. Then sometimes we will attempt to roll back, but usually, we’re pretty good about just rolling forward; and, usually, it’s caught by feature flags but not when it’s cycled.
54:25
George Miranda:
Nice. I think we have time for one last question, which is: How long does it typically take to set things up like this in Honeycomb and start feeding data in?
Pierre Tessier:
It depends on your language. So there are two things here. We’re talking about instrumenting your build pipeline. That depends on how fast you can get in there.
Charity Majors:
About an hour, I would think.
Pierre Tessier:
I was going to say about an hour. If you’re going to implement code and are sending data from your application into Honeycomb, it could be a few minutes. If you have access to the source code, it could be a few minutes. For some agents, like Java, we have the ability to make it easier.
Charity Majors:
I see one other question here, George. How do you address the moral argument that as an engineer you should be manually deploying your code and testing it manually because that’s what good engineers do? How do you convince people that we have to automate these things? Humans are failure-ridden. Like, I think this is like one of those, it’s just like the instinct to freeze up and get slow when we don’t understand things. It’s just not how software works. I think that it’s really important to make humans do what humans are good at and make machines do what machines are good at.
Do you know what machines are really good at doing? Repetitive bullshit. You know, running tests, making sure that the things you’ve already found that are wrong, that there are no regressions, you know, that’s not stuff that humans are good at. We’re terrible at those things. We’re good at speaking out and figuring out new things, assigning meaning to events. So let the computers do what computers are good at.
Pierre Tessier:
Humans are good at missing step 2 out of a 10 step process.
(Laughter)
George Miranda:
Well, that’s all we have time for today. I want to thank our hosts, Charity and Pierre, you’ve been amazing at working us through today’s presentation. For the viewers, remember, a link to this webinar will be appearing in your inbox. You can watch it on demand. Please feel free to share this webinar using that link if you like today’s content. You can find more webinars and other helpful content at Honeycomb.io in our resource library. And you can follow our hosts on Twitter: Charity Majors, @mipsytipsy, or Pierre Tessier, @PuckPuck, and follow Honeycomb @Honeycomb.io. So thank you for joining us. And happy week, everyone.
Charity Majors:
Bye. I see someone say, Now I get George’s shirt!