Refactored - Top Rated Cloud Training

View Original

Learning from Failure - Part 2

In Part 1 of this blog post, we covered what is needed for a successful Post-incident review such as collecting the conversation, determining the context, and finding any changes through logs. Now we’ll go on to gathering the data:

Collect the conversation

Azure DevOps can bring this all together. From the previous session by Jason Hand/OPS20 -Responding to Incidents, we have an Incident where he used Azure Boards, Azure Storage and Microsoft Teams for responding and tracking his incident. Teams was used for any communication for this particular outage.
Following up from that, we currently have the incident in the Azure Board and we can create a Wiki for the post-incident review by going to Overview—>Wiki

In the Wiki, start off by giving it a name that corresponds to the outage.
We can bring in data like:

  • Who was on call

  • Time to detect the issue

  • Time to acknowledge

  • Time to remediate

    *Note that we can tag everything directly without having to rebuild because it is all in the same system.

After creating the wiki, we can collect the conversation from the Teams Channel that was created for this specific incident by using Graph Explorer to query Azure. We are using it to see how to get particular messages out of a Teams Channel. First, we query the Teams we have joined and highlight the ID of the Team we want to dive deeper into; then we append it to the URL; then we hit Run Query:

The output now shows all Channels visible to us and the actual Channel that was created for the Incident. To see the messages and Run Query, highlight the ID and paste it into the URL:

Following this, we are inside the Channel for the incident and we can see all related messages which were posted. Each message is threaded individually and you can see everything pertaining to the outage:

This can also be made more legible by taking the JSON and pasting it into a viewer and you will see all messages (even emojis - yay!):

Finally, the data has been collected and we can insert the JSON into the wiki and archive to help us in the learning process.

Next, we’ll determine the context of the data.

Determine the context

To determine how we correlate what the operators were seeing to the incident itself, we go to the Azure Portal to create a dashboard. We can pin items that operators may have been looking at when the incident occurred:

To do this we go to App Insights and we can see right away some graphs and that Failed requests went up along with Server Requests. Click the pin icon on each card to put the on the new dashboard:

From here we can override settings so they don’t have independent time scales so that we can be consistent. We can see a spike on server requests and the number of failed requests:

In the image below we can see the same time-frame for server and failed requests. From there, we can drill down to a specific time period (1 hour) to select only the relevant time period we are seeking out.

The dashboard can be shared out or it can be downloaded and all of its components using JSON. With everything downloaded, we now have the context of what was going on at the time so we can investigate. Note - to watch the entire demo go to the Ignite site and search OPS30
This leads us to finding the changes…

Find the changes

This is the last thing we want to do. In his Ignite session, Jeramiah mentions to look for the delta between were we started and how we ended. Were the changes made seen or unseen? All of this can lead us into the four common traps:

The Four Common Traps

We want to avoid the four common traps that teams can fall into and we go back to the WW2 airplane accidents.

Pilot error was not a satisfactory answer, so the Air Force asked, Alphonse Chapanis a military psychologist investigate. All of the accidents were notable in the B-17s and a small handful of other planes while thousands of C-47s (transport aircraft) landing on the same runway in the same area were not having the issue. After interviewing pilots, he took a look at the cockpits the design thereof:

He found toggle switches 3 inches apart: One for the flaps and one for the landing gear. When landing your flaps will be down. Once you have landed the plane, you are going to retract the flaps. These switches being so close and having the same mode of operability (toggle) led to error. The design of the cockpit enhanced the possibility of a failure happening and thus made humans more error prone. To fix the issue, he glued a rubber wheel to the switch for landing gear and a wing to the switch for the flaps.

He is now known as the father of ergonomics or design. The design of the cockpit led to human error. Human error was a symptom of the design. The new design Anthony Chapanis created is still used today and is shown below from an Airbus A320. On the left is the gear switch which can’t be mistaken because of the wheel, on the right we can see a Wing or Flap for the flap switch. This design is mandated by Federal law to avoid further accidents.

This leads us to :

1st Common Trap - Attribution to Human Error

Human error is a symptom of the larger system. Human error can be problematic in investigations because what they did in the moment made sense at the time. People rarely make mistakes with intent to cause issues.

2nd Common Trap - Counterfactual Reasoning

In these cases, we are saying things that never occurred to explain something that did happen. It gives us excuses to not find out what actually happened.

3rd Common Trap - Normative Language

It implies there was an obvious solution and we are judging what the operator should have done. The outcome is the only piece of information that the operator didn’t have at the time!

4th Common Trap - Mechanistic Reasoning

One trap is just finding a scapegoat: “Everything would’ve been fine if it weren’t for X.”
Jeramiah points out that many production systems couldn’t stay running without human intervention at some point. “The meddling kids are part of the solution.”

“Mechanistic reasoning makes us think once we found the faulty human we’ve found the problem.”

Now we know the four common traps to avoid let’s take a look at best practices that were mentioned during the talk.

Four Helpful Practices

  1. Run a facilitated post-incident review.

    • Get all of the incident participants in a room to discuss facts right away. Do not make it a marathon and lock people in a room for days when the longest most people can focus is about 90 minutes.

    • Use a neutral facilitator that was not actively involved in the incident.

    • If necessary conduct one-on-one interviews, depending on the individual.

    • Not every incident needs this process.

  2. Ask better questions.

    • Language matters; use how or what over why questions.

    • Collect and learn from everyone.There will be different viewpoints from individuals.

    • Don’t just ask what went wrong but also what went right. Also, don’t forget to ask about what normally happens. That delta between what normally happens and what went wrong could be insightful.

    • Jeramiah recommended the etsy debriefing guide for more questions to ask which is located here.

  3. Ask how things went right.

    • Don’t just ask about how the outage happened, ask about how the system was recovered.

    • What insights, skills, tools or people were involved to help?

    • How do people know and decide what to do?

    • Are there any themes?

    • Is it the same breakdown in documentation or people and how they are interacting or even the systems and how they were configured?

    • What do we know now that we didn’t previously?

    • It is stressed to remember we care about response and prevention.

  4. Keep the review and planning meetings separate.

    • Avoid discussion of future mitigation. This will happen after the fact.

    • Hold separate smaller planning meetings 24-48 hours later and give people time to reflect. This can help keep everyone focused on what actually happened and not worried about next steps. You will get better results if you give time between discovery, incident response and next steps to fix.

    In Summary:

We are always going to have incidents and it is important how we respond to them. Human error is a terrible label and it is not helpful when looking at what went wrong. Remember: It’s a symptom not a cause. We only learn how and what happened, not what the humans did. Facilitated post-incident reviews on specific outages occur to provide value and expose what went wrong and what went right.

All of Jeramiah’s demos can be found at aka.ms/OPS30MSLearnCollection

As I mentioned before, as a previous production engineer who was responsible for thousands of systems, this session really resonated with me. I only hope more organizations take on this sort of approach to handle incident resolution.

This blog post is based on OPS30: Learning from Failure, by Jeramiah Dooley and his session can be found at https://myignite.techcommunity.microsoft.com/

Note: All images came directly from Jeramiah Dooley’s presentation.

—Amy Colyer