We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Beyond Root Cause Analysis in Complex Systems

2021/4/27

Code[ish]

AI Deep Dive AI Chapters Transcript

People

Marcus Blankenship

Robert Blumen

Topics

Robert Blumen: 我认为传统的根源分析方法在复杂系统面前显得力不从心。在IT、医疗、航空等领域，我们常常遇到复杂系统故障，这些故障往往不是由单一原因引起的，而是多种因素共同作用的结果。简单地追溯“五个为什么”并不能真正揭示问题的本质。我认为，我们应该采用一种更全面的视角，将系统视为一个由大量异构部件组成的网络，关注部件之间的复杂交互。例如，特内里费空难就是一个典型的例子，天气、机场条件、人为失误、技术缺陷等多种因素交织在一起，最终导致了悲剧的发生。因此，我认为，我们应该放弃寻找单一“根源”的执念，转而关注多个共同作用的因素，并采取相应的改进措施。 Marcus Blankenship: 我同意Robert的观点。在复杂系统中，简单地追问“为什么”并不能有效地帮助我们理解问题。传统观点倾向于追究责任，将问题归咎于个人，但这往往忽略了系统性的因素。我认为，我们应该更加关注操作员所处的环境，他们是否拥有足够的信息做出正确的决策？我们是否将他们置于没有正确信息的困境中？我认为，改善人的环境，让操作员能够做出更好的决策，才是解决问题的关键。例如，在特内里费空难中，飞行员误听指令是一个重要因素，但我们不能简单地将责任归咎于飞行员，而应该反思空管系统是否存在缺陷，是否可以采取措施来避免类似错误再次发生。

Deep Dive

Chapters

Root cause analysis attempts to find the single cause of a failure using methods like the '5 Whys'. However, this approach is inadequate for complex systems, where failures arise from multiple interacting factors rather than a single root cause. The arbitrary nature of the '5 Whys' method highlights its limitations in understanding complex system failures.

Root cause analysis is used across various domains (IT, medicine, industrial accidents, etc.)
The '5 Whys' method is popular but arbitrary, often failing to uncover the true cause in complex systems
Complex systems exhibit emergent properties where the whole is greater than the sum of its parts

Shownotes Transcript

Translations:

中文

Hello and welcome to Kodish, an exploration of the lives of modern developers. Join us as we dive into topics like languages and frameworks, data and event-driven architectures, and individual and team productivity, all tailored to developers and engineering leaders. This episode is part of our Tools and Tips series.

Welcome to this episode of Kodish. I'm Marcus Blankenship, a senior engineering manager at Salesforce. And today my guest is Robert Blumen, a lead DevOps engineer at Salesforce. Welcome, Robert. Thanks, Marcus.

I'm really excited about this episode because it's near and dear to my heart. We are going to be talking about alternatives to root cause analysis, especially when problems happen and things go wrong. We're going to discuss common root cause analysis formats and why they aren't the best way to go about thinking about complex system failures. And we're going to end with some thoughts about better ways to think about how to improve complex systems. So

So Robert, what is a root cause analysis? Okay, so root cause analysis would be different methods that people have of analyzing a failure after the fact to identify the cause. In many different domains, this is not only something we face in IT. As I looked into the literature about this, there are people in many different fields like medicine, industrial accidents, shipping,

aeronautics where you have, let's call it an incident or a failure, something bad happened, something you didn't want. In the case of IT, it means people can't check their email or they can't obtain services from a business. In other fields of medicine, maybe the patient dies, a ship capsizes, a plane crashes. You're talking about very serious outages or failures that can even result in loss of life.

The assumption here is that the world is governed by laws of cause and effect. So if you understand the cause and effect that led up to this failure, and if there is such a thing as the root cause, then you'd have an idea of how would we prevent this thing from happening again? How do we make some changes? You need to go through that analysis process and find out what is cause or causes which led to that accident.

Well, that sounds pretty reasonable. I mean, there's a lot of things in my life that are cause and effect. So what kind of steps might somebody today take when they're doing a root cause analysis? Are there some forms that are popular? It's very popular in IT, something called the five whys.

And the idea is if you ask why did this incident happen, you ask five times, then the fifth one is the root cause. This does make a certain amount of sense because let's say we had an outage and

in our system and so why did this happen? The first initial thing is we find one of the servers went down. And clearly that is irrelevant, but then you might say, well, aren't we supposed to have multiple servers and so we can handle the load if one of them goes down, it would shift the load. So clearly just saying, well, why did this happen? And the first thing you say is probably not a full understanding of why your outage happened,

The idea of five-wise is if you go back five levels deep that you will find something called the root cause. Is there anything special about the number five? No, it's completely arbitrary. And that's one of the problems with this method. So I can see that that would lead us to a deeper understanding. Why is it not a helpful way to reason about complex systems?

One of the researchers in this field, Dr. Eric Holnagle, he has a great slide in one of his slide decks of the pervasiveness of the row of dominoes metaphor.

media coverage and people writing about accidents. The idea that a system is like five dominoes and the way the failures occur is domino falls over, it knocks over the subsequent four dominoes and the fifth domino being point where it becomes visible to the user or the customer. And so if you walk five dominoes back, you find the first one and you're done. So

All of the industries and the domains where people are concerned about accidents are what are called complex systems. As I delved into the literature around this, which is broadly known as the new view of human error,

I found this distinction between simple and complex systems. Simple systems are like five dominoes, one, two, three, four, five, they fall over. Complex systems, you have a large number of heterogeneous pieces and the interaction between

the pieces is also quite complex. If you have n pieces, you could have n squared connections between them. In an IT system, you could have n squared connections, but across each connection, you could have many different protocols.

A lot of the behaviors that we're interested in are really emergent properties of the system. You can lose a server, but if you're properly configured to have retries and round robin, then your next level upstream should be able to find a different server. That's a pretty complex interaction that you've set up to avoid an outage. Now, the difference between simple complex systems and one of the researchers in the field, I'm going to quote someone, it's Kevin Heslin. He said,

Simple systems fail in simple ways, complex systems fail in complex ways. In the case of a complex system, generally there is not one thing that was the root cause. For a complex system to fail, it means all of the defenses and retries and redundancy you built in, for some reason it did not work. In order to understand what went wrong, generally you'll surface there were multiple things

And all of those things had to happen all at once to have a failure. The idea of multiple jointly contributing causes are the explanation of the failure, not one single root cause. You focus on the one root cause, you are missing a lot of these other jointly contributing causes to have a realistic understanding of why the failure occurred.

So if we go back to simple systems for a moment, I'm imagining the line of dominoes. But any single line of dominoes, whether 5 or 50 or 500, is linear. So therefore, it is a simple system. So we don't have to necessarily think that the system is small to be simple. It can be big, but it needs to exist in a certain configuration, dominoes where one thing leads to another.

But complex systems sound fundamentally different in this way. We've got so many different variables that just asking why five times isn't going to contribute to our understanding in a meaningful way. I'd love to hear an example of some failure of a complex system. Do you have any? There are a bunch, and some of the more interesting ones are not.

in the information technology world. One of the more well-known examples that's studied is the deadliest air traffic incident in history. It occurred in an island off, somewhere off of Europe called Tenerife.

It was just some incredibly bad luck of a whole number of things, which all happened at once. It started with there were two 747s that were not supposed to land there, but due to some kind of weather conditions, they both got rerouted to this same airport.

The runway wasn't typically used to handling 747s. There was not normal 747 traffic there. So the air traffic controllers didn't have a great idea of how to guide those planes. There was some bad weather condition.

But it gets way worse from there. The airport did not have the proper kind of radar that was used to guide these more modern planes. There were these two 747s on the tarmac at once. There was some misunderstood commands between the air traffic control and the cockpits of the two planes.

The pilots for some kind of failsafe failed where pilots missed cutoffs in their route. The end of the story is the two planes both tried to take off at the same time and collided with each other in the air. I hope that explanation gives you an idea of how many different things have to go wrong

all at once when you have a problem domain like air traffic control where there are so many built-in fail-safes. I'm not expert in this, but I understand there's a lot of protocols between the cockpit and the controller to ensure that the instructions are properly understood. So that had to fail.

for this accident to happen. So there were many contributing factors. You listed a whole bunch. The weather, the fact the airport wasn't meant to handle it, human error, insufficient radar technology. So there were all these factors, and had any one of them not been there, the outcome might have been different. That's absolutely right, Marcus.

I'm curious, you use the word emergent properties. Complex systems have emergent properties. Could you tell us what that means? There's a property of the system as a whole that is not a property of any one particular part of the system. One great example of that comes from economics where the market price is

depends on the marginal buyer and the marginal seller, and then the super marginal and sub marginal buyers and sellers. In order to identify who those are, you need to look at the entire market and identify the bid and ask of every participant in order to identify which ones are marginal. Another example would be something we're concerned about in IT, which is the availability of a system

We are used to now building systems out of unreliable components. We know that servers can go down, and yet the availability of the entire system can still be much greater than the reliability of any single component. So it's the interaction of components and those properties versus just how one component acts.

If we go back to that airplane accident example, I feel like a traditional view of a problem like this or a situation like this would be to start holding people accountable, to blame the pilot. I mean, they're the ones who were in control. Is that still a useful way to think about these kinds of problems? Generally not. That particular point you're bringing up, it is very much emphasized in this literature. As I mentioned, it's called The New View of Human Error.

The analysis of what went wrong involves many components, and some of those components are decisions or actions made by a human. Humans do make errors, but as we've been discussing, the human role in the outcome is one of multiple contributing factors which led to errors.

the final result. It's not the sole property. Some of the researchers in this field have suggested that there is a cognitive bias people have. If I presented to you, it's three or four different things that happened.

Some hardware failed, there was bad weather, a plane got landed in the wrong airport, and the air traffic controller screwed up. And you were asked which is the cause that people are more inclined to focus on the human error, even though it may have been equally as important to other things, but not any more or less important.

But there is a little bit of a deeper answer to that. One of the landmark papers in this area is by a Dr. Cook, who is an MD. It's a fantastic paper about complex systems. He talks about something what he calls the dual role of operators. Now, operators is a general term for meaning the people who charge.

who try to cover for the fact that something went wrong. In the end, systems depend on people. We may set up rules, like we have a cluster manager. If it sees a server that's unhealthy, it will pull it out and put a new one in. And to a certain extent, we do trust rules. But to

To really keep a system up, you need the human operator who can look at something and say, I don't think the rules we set up are working the way they're supposed to. Got a little bit of a tangent here, but there is a whole literature around the avoidance of nuclear war, which has happened a number of times because something on a screen that showed a nuclear attack was coming in and somebody whose job was to launch the counterstrike said, I don't think that's a nuclear attack and turned out to be a flock of birds.

It's important that we have creative, smart people who are problem solvers to have a check on automation and the rules that we've built into systems. Now, what Dr. Cook talks about is this concept called the dual role of operators.

The dual role is they need to preserve the operation of the system. We're a business, we need to keep the business up and running so the customers can obtain these valuable service that they've provided for.

and we need to avoid errors. Everything an operator does is with those two objectives in mind. And everything operator does is a calculated risk because any change you make might succeed in meaning preserving outputs or might fail, which could mean making the system worse or causing an outage.

This is another one of these cognitive biases that researchers have identified is at the time, the operators making a decision taking a calculated risk to preserve outputs, they don't tend to get credit for that. But in those cases where they take a risk and they fail, we're very quick to say, "Hey, Marcus, what were you thinking? Did you not know you were going to crash the entire system if you change that configuration?"

But nine out of 10 times where you made a bunch of really great decisions, kept the system up. We don't have a postmortem and say, let's look at how great a job Marcus did the last nine times he was on call at keeping the system up. There's a little bit of maybe, um,

kind of unfairness after the fact when we have more information at pointing at the person and saying you made an error. And maybe you did, but that's not really taking into account the full complexity of the situation, which is that there's a lot going on and you were doing the best you could and that it is your job to take calculated risks.

And the reason that you were in that position where you had to take a calculator risk is because other things were going wrong and you were trying to stop it. I feel like if this was a basketball metaphor, we would be criticizing the player that missed 1% of the shots rather than celebrating the fact that they made 99% of the shots and they made the points there. Sure. Well, if you're in a game that is lost by one point,

There were 100 other plays where somebody either made a basket or did not make a basket.

to get to that point where the score was tied. And so you can't just blame the one guy who missed the one shot at the last second of the game. That's a great point. And you, you know, something else you said earlier that I hear a lot and it's that, and I think it's a place where the question really matters because you use the question hypothetically, well, what was the cause of that problem? And just that word, the cause,

is a very singular, focused, like that infers there must be one cause and one cause only that we have to go identify. We can understand these simple systems, the five dominoes. One of the properties about complex systems is no one can fully understand them.

The failures tend to occur because of a cascading series of failures that no one had thought of. If you had thought of, well, A, B, and C could all go wrong at once, and that would be a failure, you might have put in some mitigation so if those three things happened, it would not fail. Other times,

You would say the chance of all those things happening all at once is so remote. It's not cost effective to mitigate it. We'll live with that. That would be a business decision. And every system has an SLA and the SLAs in the industry. Nobody strives for 100 percent. It's not achievable in our business.

Perhaps in some of these other fields where human life is at stake, they may be striving for 100%. In our business, it would not be cost effective and maybe not even possible. To go back to the airline accident analogy, you mentioned that one of the errors was that a pilot misheard instructions.

And I have to be honest, I'm thinking about all the times every day, probably even as we speak or as people are listening, that some pilot somewhere is misinterpreting or missing some instructions from the tower. And yet, I'm going to guess that doesn't result in a crash. I think I've heard that called something like a latent failure. This is another term from the field of complex systems called

A moment ago, I was telling you how no one fully understands complex systems. One of the consequences of that is they always have manifesting a subset of partial failures at any time. Let's consider some kind of an outage where if you had to have five things go wrong all at once, you would have an outage. Now, maybe three of those things have outages.

gone wrong but nobody notices because it's not showing up you don't monitor those things or they haven't produced an impact on something you do monitor or you just changed something and it's broken something but you haven't noticed it yet

These complex systems are always in a state of being partially broken. You don't necessarily discover that until there's an outage, and then you go back through the postmortem, and you realize you had a failure. There were five things that went wrong, and three of them had been broken for weeks, and no one noticed. It's fairly common in IT. You hear about outage, and some data was lost.

And people found, well, the backup hadn't run for two weeks. Someone had broken the backup script and maybe you didn't monitor whether the backup happened or maybe the thing that monitors the backup was also broken. Happens very commonly. So I'm going to go off on a tangent because your question did bring up another point. I was on a tech support call and in the past I've had the tech support read me

a certain key or password and I'm trying to type it. What if I misheard it? I type it in wrong. Nothing terrible would happen, but I wouldn't be able to get access to this resource. In this call, the agent texted me the key and I pasted it into my form and that avoided one source of human error.

One of the reasons for human error is that the system puts people in a situation where they need to do things which perhaps a person is not good at. So human error, it can result from humans being put in a position to do things which we are error prone at. And whose fault is that? It's not really the person's fault.

And maybe not the air traffic controller's fault. I was reading something recently, air traffic controllers being required now to wear masks at work and that they are having more difficulty speaking clearly or being understood. And I guess you could argue whether you're mitigating some other risk by wearing masks, but putting people in a situation in which accuracy is impaired and that's not their fault.

Well, we've talked a lot about what can go wrong and complex systems and simple systems, but let's turn our attention towards maybe better ways to do things. What's a more useful way of thinking about the causes of failures in a complex system? Asking why is important, and the overall guiding principle of cause and effect is also important.

was I started reading about these accidents, I started making graphs. The key insight here is that the understanding why something went wrong, it is not a linked list.

Five dominoes would be a linked list. It's more like a tree structure or an acyclic graph where you have one node at the edge, which is the outage or incident. Then if you step back one layer, you say, rather than what is the cause, what we're contributing causes to this, you might find one or two or three or some number. And then from each

node in the graph, you would say what are the causes or multiple contributing factors to this. And there could be any number, two, three, four, five. Some of them may be things you already identified. And then you would draw, rather than putting in a new node, you would draw a line from a node you have to that cause you'd already identified.

you can go back not necessarily five levels, you could go back three levels or six levels, however many levels as long as you're still surfacing useful information. And then you'll have this graph and it might have instead of five nodes, it might have 15 nodes that you can look at and say, well, what are the things that we want to fix? You are not obligated to fix 15 things if you find there are 15 contributing factors because you

Maybe only three or four of them are important, or you may not have the money to fix everything. It may not be cost effective to fix everything, but you can take all of the contributing factors that you've identified and rank them and then decide which are the most important ones to fix. Or you could rank them by some combination of their impact and their cost. And let's say, let's fix the most cost effective ones.

That, I think, is a better way of making systemic improvements to your system that will result in greater stability and avoidance of outages.

I'm imagining your example earlier when a system went down and the first you use the example of if we ask why once someone might say the server went down and I'm sort of seeing a branching their possibility. If like on one hand we could say, well, why did the server go down? And that leads to a whole set of factors. And you said this earlier, the idea of, well, why didn't we have a backup server that leads to a whole different set of factors. And,

Even in that simple example, I can immediately find two different lines of inquiry to start backing my way to understanding the factors that led to the outage. I was describing this to a friend and she pointed out, oh, so you could either go depth first or you could go breadth first. And since we're programmers, we know how to traverse over trees. And that would give you a couple of different ways to do it.

I also really like that you pointed out that as you see all these factors, weighting or sorting them either by what has the most likelihood to cause a big problem, what's easiest to fix, what's cheapest to fix possibly, those kinds of rankings and doing that multiple times, sorting that list differently will reveal sort of what are your top three action items or top N number of action items you could take to prevent this in the future. Yes.

So you've kind of laid out a process for using a different way of thinking, this tree model. Do you have any other steps or advice as people want to begin to stop thinking about their system in terms of simple systems and start thinking about them in terms of complex systems? One of the things that's in a lot of the literature about this new view of human error is

in dealing with people to have an awareness that you may have a cognitive bias toward focusing on the portions of the system where a person made the decision. And the idea that human error is kind of a label for the part of the system where the person is, did not work or was a contributing cause.

Then you can ask, what was the context that that person was facing at the time? Did they have enough information to make a good decision? Are we putting people in impossible situations where they don't have the right information in front of them? Was there adequate monitoring? Was there a runbook if this was a known problem? And ways to improve the human environment are

so that the operator can make better decisions if the same set of factors occurred. That's great, Robert. Thank you so much for being on the show today. Thank you, Marcus. Thanks for joining us for this episode of the Kodish Podcast. Kodish is produced by Heroku, the easiest way to deploy, manage, and scale your applications in the cloud.

If you'd like to learn more about Kodish or any of Heroku's podcasts, please visit heroku.com slash podcasts.

Beyond Root Cause Analysis in Complex Systems 26:28 Share

Code[ish]

Deep Dive

Shownotes Transcript

Beyond Root Cause Analysis in Complex Systems