“Can a subject matter expert ever step down?”
Have you written anything about how to solve the ‘nobody wants to be on-call as a subject matter expert to support people handling incidents’ problem? My best idea is ‘pay people more,’ but that doesn’t sound so compelling as a pitch.
– Kate, @thingskatedid
First, I’d like to say that pager duty isn’t something we should treat like chronic pain or diabetes, where you just constantly manage symptoms and tend to flare-ups day and night. Being paged out of hours is as serious as a fucking heart attack. It should be RARE and taken SERIOUSLY. Resources should be mustered, product cycles should be reassigned, until the problem is fixed.
Yes, paying people more is certainly one thing you CAN do, and sometimes it might be a good idea TO do so, but it isn’t an effective solution to the problem of people not wanting to be on call. It’s like duct taping over the leak in the dike. The solution is to make sure it doesn’t fucking suck to be on call.
Of course people are going to resent being on call if it means your entire life gets put on hold for a week. If you are tethered to a laptop, and can’t make plans with friends or go to the movies; if you dread getting woken up repeatedly, if you get paged about flappy alerts or the same problems week after week; of course you’ll hate it.
It is engineering’s job to own their software, but it is management’s job to make sure your time is respected and your sleep schedule is valued. It is management’s job to make sure real dev time gets allocated to fixing the damn problems, and they don’t get pushed under the bed to molder away into tech debt. I believe managers’ own performance should be judged in large part on how effectively they protect and insulate their people from interruptions and wakeup calls. They should be evaluated in part by the four DORA metrics, plus a fifth: How often is your team alerted outside of working hours?
I believe it is reasonable to ask anyone who works on a 24×7 highly available system to support that system. I believe it’s reasonable to expect to be woken up 2-3 times a year, as an engineer for that system. There will be exceptional situations where this is impossible, like during hypergrowth, but more than that, as a matter of course? Really is not okay. Allocating sufficient resources towards system stability is a management problem, and managers should be held accountable for their results.
How to accommodate life disruptions when distributing on-call load
I agree with all of these things, but something still doesn’t quite fit for me, and I’m trying to work out why. In this case, I’m the one who doesn’t want to be on call, after five years in this role. I agree with all the things you said, about pulling your weight and owning your code and doing your part, but I just don’t want to do it, and I’m not sure how to reconcile that in my head.
Oh! I am so glad you pushed me on this. I do tend to make very sweeping, absolute statements, when in reality, no answer is ever universally true. Because this is a system made up of humans, there are naturally plenty of edge cases and grey areas.
Let’s take a look at a few of those edge cases that a team might need to work around:
- Someone has a kid. It would be inhumane to ask someone to live with TWO tiny agents of chaos, prone to melting down at all hours. I would never ask someone with a young child to also carry a pager. It’s just too much. Wait til the kid can sleep through the night, for goodness sake.
- Someone is dealing with chronic insomnia. I’ve had terrible bouts of this myself. It is murder when you’ve been trying to fall asleep for four hours, and then juuuust after you’ve dozed off, DINGDINGDINGGGGG!—the pager goes off and you want to scream. Your head isn’t clear enough to be any good as a debugger, and it isn’t relaxed enough to get the rest you need. Fuck everything.
- Someone has extreme anxiety. I once had someone on my team who really wanted to be on call, really wanted to pull his weight, but he had so much anxiety about it that he couldn’t fall asleep when he was on call, like at all. It didn’t matter if the pager was peaceful all week long—the mere fact of it kept him wired and anxious all night long.
If someone willingly signed up to join the team and share the load of ownership, but then became temporarily or permanently incapable of participating in on call, you address this as a team by looking for other ways that person can pitch in and do their share of the less-than-glamorous work.
Maybe that means they are in charge of failed builds and keeping the CI/CD run time under 15 minutes, or maintaining the dev environment. It doesn’t matter what it is, as long as the entire team feels like it is a fair distribution of labor, and as long as there are enough people left in the rotation to make it reasonable (usually 5+).
Exceptions and edge cases are always negotiable. But then there is a second category of problems, like the pickle you are in now, that may look the same—someone doesn’t want to be on call—but are very different in type and solution.
These issues are deeper and trickier, because they are actually organizational problems disguised as on-call problems. For example,
- The bus factor. You’re responsible for a mission-critical component of the system, and have been since the start. You are the debugger of last resort; despite some attempts to bring new engineers up to speed, you still end up getting called in anytime something goes wrong with this component.It’s not entirely clear whether anyone else could fix it if you weren’t there or how long it would take them in your absence.Which means you feel emotionally tethered, shackled to this work. You take pride in your work, so you stick around and shoulder the load, even though it’s slowly sapping your motivation and energy and will to work.
- Individuals owning things. In a healthy engineering organization, there are no gaps in coverage. Every critical component is owned by a TEAM, not a person. People practice pairing and buddying up for code reviews for just this reason, to make sure other people know about the tricky bits, the twiddly bits, the history, how to debug, how to ameliorate. The more critical the component, the more urgent this coverage becomes.
- Working on the same thing for too long. In an organization of any size, you really want to encourage engineers to move around every 2-3 years, not sit on their area of expertise and stay there. This is better for the engineers—it staves off burnout, keeps high performers learning and motivated, and is better for the code base. Fresh eyes bring waves of standardization and best practices, so it looks and feels like the rest of the code. It’s also better for the company because it helps them retain their best engineers by keeping them interested and engaged, and mitigates the bus factor risks.
- Insufficient coverage. If you’re in a rotation of 2 or 3 people, that’s not a real team—and it’s not a real rotation. The problem isn’t the pager responsibilities; the problem is the team structure within the organization.
All of these may look and feel like people not wanting to be on call. But if you look a little closer, they’re actually problems of resource allocation, risk mitigation, and other organizational dysfunctions.
I hear echoes of these dysfunctions in your story. You’ve been working on the same component for five years. You get looped in every time it breaks. That’s a long fucking time to care for one part of the codebase, no matter how critical. It’s too long. If a component has been owned by the same person for 5+ years, it’s going to look and feel different from the rest of the codebase in ways that make it harder for others to learn it and make changes.
If you have been owning it solo, you have a bus factor problem. If you technically share ownership with the help of a team, yet you personally end up getting pulled in every time the pager goes off, your manager needs to rotate you off of this component completely, immediately, and send you on a nice, long, three-week vacation on a tropical island with no cell service. (Or at least make you the escalation point and don’t let anyone escalate until they’ve been trying to fix it for half an hour.) They’ll figure it out. I promise.
To sum up: No. It is not your job to be on call for this code.
You seem to understand this intuitively, but find it hard to articulate why. Try this:
“I shouldn’t be on call because I’ve been working on it for too long, because I’m a single point of failure, and because it needs to be owned by a proper team.”
It is not your job to shoulder the burden and suffer for the sake of harmony, or to go on and on doing things just because others don’t currently know how. It is your job to surface these problems to management and help them to do the right things.
Then go build something new after a good long vacation. You’ve earned it.
Have a question for Miss O11y? Send us an email!