(Field notes from O’Reilly’s Velocity 2019 Show, San Jose.)
It was steamy hot in San Jose during O’Reilly’s Velocity show and the normally frigid AC temps in the expo hall were welcomed by all attendees, escaping the 104 degree temps. It got so bad, Charity Majors labeled it Satan Jose and the nearby Marriott hotel experienced a power outage for almost two full days, leaving guests hot under more than just their collars.
Outages are a thing software engineering teams deal with fairly regularly, and for many, relief can come in the form of a blameless culture–but it’s high pressure and high stakes, nonetheless. Having a shared understanding of potential system vulnerabilities is what developers spend time on, then optimizing for the best way to fix and resolve so that systems scale, users stay happy and engaged. Sounds pretty straightforward but in reality a highly complex challenge.
This year’s Velocity show attracted over 2,000 attendees as O’Reilly merged their Solution Architect event with Velocity’s engineering, operations and SRE attendees. Traffic in the expo hall was consistent and what was particularly interesting was the varied nature of conversations based on the attendee’s functional ownership and area of responsibility. The role and responsibilities of DevOps teams has morphed considerably even over the last decade and the newer role of an SRE has caused further confusion when it comes to drawing clear lines of who does what and who is responsible for which pieces of the software engineering cycle. Some of the keynotes covered this topic and if you didn’t catch them live, you can watch on-demand. Thanks to O’Reilly for sharing.
What it means to be an SRE
Keynote The SRE I aspire to be presented by Google’s Yaniv Aknin covers what this function means. He outlines some differences between how it contrasts with the role of someone in DevOps. An SRE team is responsible for site design and applies a scientific approach to reliability, with the goal being to preserve what you have today, contrasted with what you had yesterday. 100% reliability is usually not possible because chances are, you’re not innovating fast enough but instead focused on optimizing for uptime. In fact, users probably don’t need 100% uptime, or may not even notice. Tracking the cost of that uptime is what an SRE should care about. It’s the balancing that’s difficult and it doesn’t really matter how small, medium or large the team because at some point you must figure it out for the business. Aknin goes on to share the details of what an error budget is and how to understand what users and the business can tolerate, classifying it as essentially the measure or constraint that can be traded off relative to other constraints.
Coming up with those measures and getting agreement across the entire organization is critical. Using language and translating those metrics isn’t as easy as you think. Mean time to Recover (MTTR) and Mean time to Failure (MTTF) are measures a developer can understand, but they need to be translated for product management and business stakeholders so they don’t freak out when there’s an outage or latency experienced by say a particular customer.
Knowing the defined Service Level Objectives (SLOs), sharing those with relevant team-members, and then having the ability to adjust thresholds over time can be pretty powerful. Optimize for less ‘ops toil’ and as long as you’re not hurting customers and the business, you can tolerate an error budget and maintain a happy outcome for both devs and customers. We can all understand that unwanted alerting burdens on-call teams, causing burnout with a detrimental effect on the business overall. Aknin closes with Align what users care about and ask yourself ‘Is it all about 9’s?’
It’s about more than just technology
Everett Harper, Truss CEO’s keynote session talked about how teams can avoid missing important details when building complex, innovative systems, and encouraged attendees to think outside the box and include diversity when hiring. Team members can often feel vulnerable when raising new ideas and doing something different. Fear of judgement can stultify when coming forward with something new which requires both bravery and vulnerability. In Harper’s words,
There’s no act of courage without vulnerability.
Doing something different requires not only taking action but mobilizing others to follow which comes with risk and reward. At Truss they decided to be completely transparent about team compensation to the point where salaries are actually printed on employee’s business cards. Quite extraordinary, risky and it worked. With over 70 employees, Harper shared how Truss went about normalizing such a thing inside the organization. Conducting pre-mortems, post-mortems and making sure it could scale safely was critical to roll-out. With this new level of transparency, they had to fix pay bands and while the bold move exposed some weaknesses in systems and processes, they were duly fixed and the team now spends less time stressing and politicking over the topic.
As the industry matures…
During the show, Honeycomb hosted a MeetUp where we shared our new Observability Maturity Model Framework. We divided the room into five tables representing the most critical software engineering process areas that represent where observability has an impact. Using a ‘reverse panel’ format, attendees openly shared how they approach observability in their orgs and its impact on their team’s life-quality including that of their customers and business overall.
The most popular table topics were technical debt and operational resilience (aka incident response) which is perhaps indicative of where pain is most felt as teams feel under pressure to innovate faster and maintain reliability.
Velocity comes with a price and striking the balance is what everyone strives for. If you’re just getting started on your observability journey, or you want to learn more about the Honeycomb maturity model, join the webcast with Charity and Liz on July 10th at 10am PDT.