The Future of Ops Careers

Person in costume looking behind a curtain

10 Min. Read

Have you seen Lambda: A Serverless Musical?

If not, you really have to. I love Hamilton, I love serverless, and I’m not trying to be a crank or a killjoy or police people’s language. BUT, unfortunately, the chorus chose to double-down on one of the stupidest and most dangerous tendencies the serverless movement has had from day one: misunderstanding and trash-talking operations.

“I’m gonna reduce your… ops
I’m gonna reduce your… ops”

Well, I hate to tell you, but…

“No, I am not throwing away my… ops.
And you’re not throwing away my… ops.”

Or anyone else’s for that matter.

Even if you don’t run any servers or have any infrastructure of your own, you’ll still have to deal with operability and operations engineering problems. I hate to be the bearer of bad news (not really), but the role of operations isn’t going away. At best, the shifts that supposedly reduce your ops are simply delegating the operability of your stack to someone that does it better. The reality for most teams is that operations engineering is more necessary than ever.

Beyond Hamilton clap backs, that distinction matters because it has real career ramifications for engineers who, like me, are so operationally minded. Where are Ops careers heading?

Where Does Ops Fit, Anyway?

In some corners of engineering, “ops” is straight up used as a synonym for toil and manual labor. There is no good ops, only dead ops. The existence of ops is a technical failure: a blemish to be automated away, eradicated by adding more and more code. Code defeats toil. Dev makes ops obsolete. #NoOps!

If this is such an inexorable march towards utopia, maybe someone can explain to me why the shops that flirt the hardest with #NoOps have been, without exception, such humanitarian disasters?

Or, I’ll start. Operations is ridiculously important. When you denigrate it and diminish it, that’s the first sign that you aren’t doing it well. The way to do something well generally starts with adding focus and rigor, not writing it off.

Consider Business Development and Operations. Business is the why, development is the what, operations is the how. Operations is the constellation of your organizational memory: patterns, practices, habits, defaults, aspirations, expertise, tools, and everything else used to deliver business value to users.

The value of serverless isn’t found in “less ops.” Less ops doesn’t yield better systems than more ops, any more than fewer lines of code means better software. The value of serverless is unlocked by clear and powerful abstractions that let you delegate running large portions of your infrastructure to other people who can do it better than you — yes, because of economies of scale, but more so because that’s their core business model. YOUR core business model probably has nothing to do with infrastructure.

Because of that, a great sort is now happening between software engineering, infrastructure operations, and core business value.

What Is Infrastructure?

Infrastructure is software support. It’s the prerequisite thing you have to do, in order to get to the stuff you want to do. It’s not what you want to be doing, yet your business goals presume its existence.

An important quality of infrastructure is that it typically changes less often and is more stable than the software that constitutes your core business value. The features you ship to customers are typically under constant or frequent development, and they change at the rate of pull requests and commits (in fact, the velocity of these changes can be a critical competitive advantage). Infrastructure, on the other hand, changes at a more glacial pace — at the rate of package managers, OS updates, and new machine images. It’s seconds-to-minutes versus hours-to-days.

This dividing line between infrastructure and core business value even holds true for companies whose business model is building infrastructure for other companies. For example, a company providing email focuses on products that consist of email workflow features that are constantly being developed and shipped to users. There isn’t much new business value to be wrung out of modifying commodity SMTP transport layers or optimizing IMAP servers.

To its credit, serverless is perhaps the first trend to have really understood and powerfully leveraged that dividing line. IaaS, PaaS, and full-service suites like Gitlab were all germinal forms of this shift. “Cloud native” was also, arguably, another lurch in that direction. But where has that taken our industry?

*-As-a-Service Is Really Just Code for “Outsourcing”

IaaS, PaaS, and even FaaS/serverless are really all just types of outsourcing. But yet we don’t call it “outsourcing” when we rely on companies like AWS to run our datacenter and provide compute or storage, or when we use Google apps for our email, documents, and spreadsheets?

Historically, “outsourcing” is what we call shifting work off-premises when we aren’t yet comfortable with the arrangement; whether because the fit is awkward, the support is incomplete, or the service isn’t on par with what we could do ourselves. With infrastructure outsourcing, service quality is now creeping up the stack. More and more complex subsystems are becoming commodity components: and other companies utilize them to build their own businesses (or other infrastructure!) on top.

When I started my career, I was a jack-of-all-trades systems person. I ran mail, web, db, DNS, cache, deploys, CI/CD, patched operating systems, built debs and rpms, etc, etc. Most engineers don’t do those things now, and nor do I. Why would I, when I can pay someone else to abstract those details away, so that I can spend my time focusing on delivering customer value?

Increasingly, as an industry, we are outsourcing any bits that we can.

As a more personal example, why would you want to run your own observability team or build your own in-house monitoring software, if that’s not your core business? Why split your focus to building a bespoke and unsustainable version of a thing when you can readily buy a world-class version? If my company has had ten or twenty full-time engineers working on that solution, how long will it be until your team of three or five can catch up?

In a post-cloud world, we’ve learned that it’s usually much better and far easier to buy than it is to build those things that don’t add business value.

How to Outsource Things Well

In my personal example, buying doesn’t mean that you shouldn’t have an observability team. It means that the observability team should turn their gaze inward. That team should take a page out of the SRE or test community’s books and focus on providing value for your org’s developers whenever they interact with this outsourced solution.

That team should write libraries, generate examples, and drive standardization; ushering in consistency, predictability, and usability. They should partner with internal teams to evaluate use cases. They should partner with your vendors as roadmap stakeholders. They might also write glue code and helper modules to connect disparate data sources and create cohesive visualizations. Basically, that team becomes an integration point between your organization and the outsourced work.

We already know from industry research that the key to success when outsourcing is to embed those off-prem contributions within cross-functional teams, which manage integrating that work back into the broader organization.

Monstrous amounts of engineering work create the stack that ships value to your customers. Trying to save work, some teams build complicated Rube Goldberg machines that are brutal to run, change, and debug. It’s much harder to build simple platforms with operable, intelligible components that provide a humane user experience. Bridging that gap requires quality operations engineering to streamline that outsourcing for successful user adoption.

That’s why even if you run no servers and have no infrastructure of your own, you still have operability and operations problems to contend with. Getting to the point where your org successfully has no infrastructure of its own takes a lot of world-class operations expertise. Staying there is even harder. Any jerk with a credit card can just go spin up a server you’re now responsible for. Try being any sort of roadblock and see how quickly that happens.

What This Means For Operationally Minded Engineers

The reality is that jack-of-all-trades systems infrastructure jobs are slowly vanishing: the world doesn’t need thousands of people who can expertly tune postfix, SpamAssassin and ClamAV — the world has Gmail. You might find your next job by following the trail of technologies you know, like getting hired as a MySQL expert. But technologies come and go, so you should think carefully before hitching your cart to any particular piece of software. What will this mean for your career?

The industry is bifurcating along an infrastructure fault line, and the long-held indistinguishability between infrastructure-oriented engineers and operationally-minded engineers is swiftly eroding. These are becoming two different roles and career paths at two different kinds of companies: infrastructure providers, and the rest of them. Those of us who love optimizing, debugging, maintaining, and tackling weird systems problems far more than writing new greenfield code, now have a choice to make: go deep and specialize in infrastructure, or go broad on operability.

If the mission of your company is to solve a category problem by providing infrastructure to the world, then operations will always be a core part of that mission: your company thrives by solving that particular operability problem better than anyone. So you are justified in going deep and specializing in it, and figuring out how to do it better and more efficiently than anyone else in the world — so that other people don’t have to. But know that even this infrastructure-heavy backend work also needs design, product management, and software engineering work — just like those non-infrastructure focused companies!

If your chosen company isn’t solving an infrastructure problem for the world, there are still loads of opportunities for ops generalists here too. But know that a core part of your job is critically examining the cycles your company devotes to infrastructure operations and finding effective ways to outsource or minimize their in-house developer cycles. Your job is not to go deep if there is any alternative.

I see operationally-minded engineers working cross-functionally with software development teams to help them grow in a few key areas: making outsourcing successful, speeding up time to value, and up-leveling their production chops.

They’re evolving very crude “build vs. buy” industry arguments (often based on little more than whimsical notions) into sophisticated understandings of how and when to leverage abstractions that radically accelerate development. They build and maintain the bridges that make outsourcing successful.

They’re evolving release engineering to fulfill the delivery part of CI/CD. Far too many teams are perfectly competent at writing software, yet perfectly remedial when it comes to shipping that software swiftly and safely.

They’re also up-leveling the production operational skills of software engineers by crafting on-call rotations, counseling teams on instrumentation, and teaching observability. As teams leave behind dated metrics and logs, they start using observability to dig themselves out of the ever-increasing massive hole where everyone constantly ships software they don’t understand to a production system they’ve never understood.

Everyone needs operational skills; even teams who don’t run any of their own infrastructure. Ops is the constellation of skills necessary for shipping software; it’s not optional. If you ship software, you have operations work that needs to be done. That work isn’t going away. It’s just moving up the stack and becoming more sophisticated, and you might not recognize it.

I look forward to the improved Lambda Serverless Musical chorus:

I’m going to improve your… ops.
Yes, I’m going to improve your… ops!

Read more about Honeycomb’s hiring methodology. P.S. We’re hiring!

Don’t forget to share!
Charity Majors

Charity Majors

CTO

Charity Majors is the co-founder and CTO of honeycomb.io. She pioneered the concept of modern Observability, drawing on her years of experience building and managing massive distributed systems at Parse (acquired by Facebook), Facebook, and Linden Lab building Second Life. She is the co-author of Observability Engineering and Database Reliability Engineering (O’Reilly). She loves free speech, free software and single malt scotch.

Related posts