Rethinking "Root Cause"


10/15/2020

by Timothy Carioscio

While "root cause analysis" has been the standard approach following incidents, it may be time to change our approach

A few weeks ago, one of our customers experienced an incident causing slowness in their quality assurance S/4HANA instance. Once the instance was recovered, Aaron, one of our Senior SAP on AWS experts, went about performing a root cause analysis. He quickly found that things were not so simple. While the incitement of the incident ended up being network slowness combined with a number of other peculiar circumstances, he didn't feel comfortable calling any one of those circumstances the *cause*. Steps were taken to ensure that the issue would not recur but it didn't fit neatly into a single root cause.

Performing root cause analysis is the industry standard and has been for as long as I've been using computers. I come from a family of tech professionals and I was taught at an early age to find *the root cause*. I have memories of sifting through lines of BASIC as a child, trying to figure out why my computer wasn't acting the way I had expected it to. In the ensuing years, I have worked with teams across industries on various platforms and root cause-thinking has been a near constant. We're not still writing programs using BASIC, so why are we still approaching issue remediation in the same way we have been for the past 3+ decades?

In the past five to ten years, CONTAX has seen fewer and fewer of our SAP clients hosting their own servers and maintaining their own infrastructure. It has become standard practice to abstract the hosting and maintenance of critical infrastructure to hosting providers and frequently, to the cloud. When I was writing BASIC and looking for root causes of failure, I controlled the whole technology stack on my personal computer. If a cable was loose, it was within my power to reattach it. In today's new cloud-based paradigm, that is no longer possible. In the case of cloud computing, companies have no access to the physical hardware, and gladly offload the upkeep of it to their provider. As technology stacks become more complex and abstracted, it is no longer possible to control for root causes. Root cause analysis is less meaningful as a result. A catastrophic hardware failure, or a severe weather event may have been the *cause* of your issue, but there is little that can be done to remediate those causes. More often, as was the case with Aaron and the incident a few weeks back, there were a lot of little causes that rolled together to create a problem.

When AWS published its newest update to the Well Architected Framework on July 10, 2020, I was delighted to see that they updated the Operational Excellence pillar to prefer the terminology "post incident analysis" over "root cause analysis". Their new phrase goes a long way in curing what had bothered Aaron in the stale "root cause" phrase. It is insufficient to describe the nuances of our modern cloud architecture.

Failures will occur, drives will corrupt, the power will go out, the internet will go down. We understand that these things will happen, and yet we still act as if there will be one thing that we could have caught, and one fix that needs to be made. That may have been true in the past, but it is not today. Instead of preventing these failures, the best companies plan for them. They view incidents as opportunities to learn. They make incremental changes to make their architectures more resilient and fault-tolerant. They don't get caught up searching for a single root cause.

Root cause analysis is dead. Long live post incident analysis!



Author

About the Author: Timothy Carioscio

Tim is an AWS evangelist. Rather than having his head in the clouds, he lives with the Cloud in his head.