Watch Phillip Carter’s talk on GenAI.
I’ve read a steadily increasing stream of articles about using AI in SRE, and I have yet to find one that inspires my trust. Each article makes impressive claims about the capabilities of AI and the way it can be applied to SRE tasks, but the vast majority are light on details.
AI tools, and especially LLMs, are growing incredibly quickly, and I feel that these tools have a ton of potential. Still, in site reliability engineering, we face significant risk in adopting these tools, so I need to be confident that the benefits will likely outweigh the risks.
This article is not going to be a curmudgeonly takedown of AI from an old guard SRE afraid to change. I’m going to lay out a few straightforward guidelines for AIOps tool vendors, and if you follow them, you’re going to catch my eye. I’m drawing a lot of inspiration from John Allspaw’s excellent article, An Open Letter To Monitoring/Metrics/Alerting Companies.
Will you make my life easier?
What kinds of tools am I talking about? Anomaly detection, machine learning, log analysis, automated root cause analysis, automated incident response, and the like. I get really interested when LLMs, machine learning, or anomaly detection are applied to making the lives of SREs easier. This field has so much potential, and if you can build a tool that really delivers, I want it.
What if it doesn’t deliver?
What if a tool’s root cause suggestion or recommended action in an incident points me in the wrong direction? I could waste critical time chasing a dead end, or the tool might narrow my focus, causing bias that makes it harder to look in a more useful direction.
What if it delivers too much?
Anomaly detection tools can provide significant value by surfacing unexpected problems early on. However, they can also alert on conditions that are benign, or even expected. If this happens too often (and at 3 a.m.), SREs will lose trust in the tool and be less likely to pay attention to the truly useful alerts it produces.
What if it takes an incorrect action?
Some agent-based AIOps tools have the capability to directly act—for example, to automatically respond to incidents. What if it takes an action that damages the system? If it incorrectly determines that its action successfully resolved the problem, will it fail to alert a human?
It’s unreasonable to expect perfection, and I don’t need a peer-reviewed scientific study of an AI SRE tool. I do need to see that a tool vendor understands the risks I will take on by implementing their tool. No tool, action, system, or even employee hiring decision comes without risk. Responsible engineers evaluate those risks and decide that the potential benefits tip the balance toward adopting a tool—or, at least, giving it a serious evaluation.
How to gain an SRE’s confidence
What would inspire confidence in an AIOps tool?
- specific examples of what the tool has done in the real world—the more detail the better!
- a discussion of the ways the tool can fail and what you’re doing about it.
- an understanding of the risks I could face in implementing the tool.
- data on how often your system produces useful, actionable results.
AIOps tool vendors have done a great job in helping SREs understand how we could benefit from their tools. Now, it’s time to lean into the other side: understanding the risks we face and helping us critically evaluate their tools.
Very few articles about AIOps tools touch on the points above, and that’s a tragedy: there’s a real opportunity when you get to our level and show that you understand the risks of implementing your solution. If you can give us the information we need to critically evaluate your tool, you’ll really stand out from the crowd, and we’ll be first in line to evaluate it.