Andi Smith

Technical Leader Product Focused AI Consultant

Recovering from Failure

  • By Andi Smith
  • 4 minute read

In case of emergency, read note.

Being able to recover quickly from failure is more important than having failures less often.

  • John Allspaw

In technology, failure is inevitable. Things will go wrong. So rather than hoping they don't, it's better to plan that they do.

Here are some of the things I think work well for when you have an incident. As you'll notice, they mostly require upfront planning. If you've found yourself already in the incident, it's likely too late.

Know what to do

Ensure there is a procedure for what to do in an incident. It should cover the following areas:

  • Identification
  • Containment
  • Resolution
  • Maintenance/Actions

Know how to contact the team

Set up a call tree with the team's numbers and locations. There may be a on-call rota, so link to that too.

Also include the numbers for stakeholders who will need to know how to handle customer complaints, the press .etc

Create an incident check-list

Often, incidents follow the same patterns so having a check-list will allow you to quickly find or disregard common issues. Some of the items to include on the list would be:

  • Has there been a new deployment at the same time as the incident?
  • Has anyone changed a configuration setting recently?
  • Are our system dashboards reporting healthy?
  • Are any of our vendors down?
  • Is there a spike in errors in your logging software?
  • Have we hit any API limits?

You can add links to the checklist so whoever is running the incident can just click through to check each one.

Agree your war room strategy

When there is an incident, know how your team will collaborate.

  • Is it a video call or a Slack Huddle? Is it an existing room, or a room created especially for the incident?
  • Who will communicate to the rest of the business what is going on? How often will they do that?

Agree on it and document it so everybody knows.

Run a Post-Mortem

Running a post-mortem on an incident is a great way to stop it happening again. Even smaller incidents can bring light to issues they may cause bigger problems later on.

The post-mortem should be carried out with the team that were involved in the incident and any other security or developers that work in that area. Allow the team to talk through the problem and capture notes.

Most importantly, make sure you review the incident reports once a quarter to ensure the actions recorded actually get actioned!

Here's a Markdown template of my Incident Post-Mortem report:

## Executive Summary

- 3 or 4 bullet points of what went wrong and how it was resolved.

## Stats

- Number of affected users:
- Time to acknowledge:
- Time to resolve:
- Total downtime:

## Timeline of Events
| Date/Time | Event |
| --------- | ----- |
|           |       |

## Client Impact

Details on any annoyed clients or critical client problems that were caused by the incident.

## Analysis
* Lead-up:
* Fault:
* Detection:
* Response:
* Recovery:
* Recurrence:

## Root Cause (5 whys)
### Why?

### Why?

### Why?

### Why?

### Why?

## Resolution
### Short term
* What was done in the heat of the moment (or shortly after) to resolve?

### Long term
* What needs to be done longer term to stop this from happening again?

## Lessons Learned
* 2 or 3 bullet points on what was learnt

## Action Items
* Links to JIRA tickets with resolution dates

Good luck with your next incident!

Andi Smith

By Andi Smith

Andi Smith is a passionate technical leader who excels at building and scaling high-performing product engineering teams with a focus on business value. He has successfully helped businesses of all sizes from start up, scale up to enterprise build value-driven solutions.

Related Blog Posts