Learning from Failure - Part 1

November 8, 2019 Skylines Academy

Jeramiah Dooley, Sr. Cloud Advocate at Microsoft gave an insightful talk at Microsoft Ignite that everyone can learn from whether you’re in the trenches handling the outage or overlooking the whole team and process. Coming from personal experience on the Operations side, I knew how important for Production to always be available but that obviously cannot always be possible. Incidents happen, plain and simple but I really didn’t enjoy the following while in Ops:

Troubleshooting an outage on multiple bridges at 2 am
The blame game that would ensue via the RCA

Jeramiah, mentions that Post-Mortems end up being a chore or forced upon the people rather than giving value. Especially, when we have same incidents that happen over and over and we’re not learning anything from the process. This is where we need to take a step back and look at how we are performing our post incident review. In many cases, incidents don’t always fall into a specific checkbox or category for the root cause analysis.

A good example he brought up was a specific aircraft in WW2 that was involved in significant number of identical accidents. The aircraft would land successfully but as the plane was taxiing the landing gear would retract without warning. Obviously, this was a big issue since every plane was needed during that war.

http://www.americanairmuseum.com/aircraft/10376

Investigators would look at the mechanical and electrical components and find nothing wrong. So it was deemed Pilot Error. Obviously, this was an unsatisfying answer considering numerous planes were having the same issue, specifically the B-17. The investigators were correct in their finding but yet the problem persisted.
Dooley, proposes there is something missing in both the conclusion they come to as well as the investigation itself. Later on you will learn the outcome of this scenario.

This leads us to the Dickerson Hierarchy of Reliability similar to Maslow’s Hierarchy of needs: We need a foundation in order to continue up the stack. We can use the hierarchy for the purpose of Microsoft’s OPS Learning Path. OPS10-50 are all separate sessions and build upon each other. To watch the whole talk track, I recommend you visit https://myignite.techcommunity.microsoft.com/ and you can stream them in order .

For Jeramaih’s talk track, OPS30, he focused on Post-Incident Review.

Now we get into the following topics on the agenda:

Why learn from Incidents - There are different layers in actively asking not how but why
Post Incident Review Baselines - We need to build common terminology amongst the team(s)
Four Common Traps - Teams fall into common traps during the post incident review
Four Helpful Practices - How can we make the reviews better and learn from them

Why Learn From Reviews/Incidents

Most of the systems we work on are complex whether it’s development, on-premises, or cloud. There are a lot of parts involved and it is not simply one straightforward component. “Behavior of that system is driven by the interaction of the parts, not the individual parts themselves”. Too bad that is the case. It would be great if we just had to worry about one component, but in Ops we have to worry about complex environments and applications.

Dr Richard Cook wrote a paper “How Complex Systems Fail" that applies to healthcare, software operations. All complex systems have similar traits and will experience failure.

Takeaways that were pointed out were:

“Complex Systems contain changing mixtures of failures latent within them” Meaning we may have built a system but there is something not properly configured but is running for now. All systems have issues but you may not know at the time because of resiliency which leads us to:
”Complex Systems run in degraded mode” with the availability we’ve built in we can run in a degraded mode.
Lastly, “Catastraophe is just around the corner” I know that makes you feel warm and fuzzy but with complex systems with all the moving parts, failure is going to happen

From that we can focus on two points: Prevention and Responsiveness to catastrophes
Jeramiah, mentions very familiar use case. There are many times we introduce automation, something to prevent catastrophe or a friendlier UI to make things easier but ends up ultimately impairing our ability to troubleshoot quickly. These systems are not just about the technology but the human systems. There are human interactions with the technology involved. He even states we as Ops people don’t work on the system but work inside the system and we can be part of why that complex system fails. It’s so important to realize the humans involved are just as important as the technology

The human response to an incident just as important as prevention.

This where understanding that Language Matters and people respond differently under stress. People are complex themselves with different personalities, skillsets and the words we use regarding an incident can affect what and how we learn.

Certain safety-based industries such as Healthcare, Fire and Rescue, Aviation are seeing the same patterns and learning from failure time and time again. They call it Resilience Engineering and from a systems standpoint, we can learn a lot from their findings.

This also leads to the four common traps that Jeramiah will cover later on.

Post incident review baselines

Every incident has a lifecycle as you can see from this diagram. The issue is detected, responded to, hopefully remediated but then it is time to analyze what actually happened.

Here we are going to focus on analysis(highlighted in blue) of the incident so we can learn from the incident response before rebuilding/creating fixes to the process and proceeding to the readiness stage

Personally, to me this is very important and perhaps in my opinion why we can see similar failures repeat. If we are too quick to analyze the incident and move on, then we might keep making the same mistakes.

What should we do after an incident?

Perform a post incident learning review. Perhaps, get away from the terminology post-mortem. Others use the term retrospective. This is where language is important
Get everyone involved in the incident in the room as quickly as possible. We tend to forget 24-36 hours after the event. This could also be a neutral one-on-one interview depending on personality of the engineer
Post incident review must be blameless. We’re not on hunt to find who did something wrong but why did we make a process that made it easy to delete an entire environment (paraphrasing Jeramiah) or as he mentioned “What does this button do at 2 am“ to much laughter. We are after context and knowledge; not who did what.

I believe once post incident reviews are blameless, you’ll get more information out of the people involved. Personally, what we called RCAs (Root Cause Analysis) on some of my teams, was ultimately to find out who to blame and perhaps write up, causing people to keep their mouths shut instead of offering their insights and knowledge. Leading us to:

Sure you can do the post mortem and fire those involved but that will lead you to an empty team, maybe an inexperienced team or perhaps no one will want to join your team. It definitely won’t lead you to a reliable system. This is where I felt like shouting “Amen” to Dooley but did not want to disrupt the session.

Again, the whole process should be an honest inquiry with those involved and lead to discovery of facts. The different perspectives of people help and not everyone is going to necessarily agree, but we need that brought to the table. It’s something that happens to involve technology and is a HUMAN process.

What it should not be:

A document or report - it may generate one to gather all the data but that’s not the primary goal. We have to learn first
A determination of causality (guessing what went wrong or why); the intent is to learn. There might be a time you find X caused Y but again the purpose is learning.
Not a list of action items - don’t just randomly put together a punch list of patching or talking to specific people. Again if one comes out of it after the due diligence of learning what went wrong, that’s fine. But don’t rush to come out with a list

Reminder: A post incident review is a LEARNING review
We need a timeline to collect the data or a shared chronology and it won’t necessarily be linear. Certain processes may have been happening at the same time while others were before the incident. There will also be different perspectives on what was happening at the time. Humans are different. There are people who respond well under pressure, people where time collapses, etc. In order to get this timeline, we need to collect the data and the best way to do that is:

Collecting the conversation. Have a place for everyone to communicate. Discuss what worked and what failed with the historical information you have. Maybe someone is scared to try something or another knows what has worked well in the past
Determine the context of what failed by using your monitoring. We need a point of time of the failure in order to correlate the data.
Find the changes using activity and audit logs. This may be difficult to dig through and correlate with the incident but we can do all of this with Azure DevOps

In Part 2, the Four Common Traps, Four Helpful Practices and how to perform all of this within Azure DevOps is covered.

This blog post is based on OPS30: Learning from Failure, by Jeramiah Dooley and his session can be found at https://myignite.techcommunity.microsoft.com/

Note: All images came directly from Jeramiah Dooley’s presentation.

—Amy Colyer