IT Post Mortem Guidelines
Guess the Five Ways (5W)
What?
A postmortem (also called a major incident review in the ITIL world) is a document shaped process.
Considered as an important part of a global improvement strategy, the postmortem document allows to highlight what worked well during an incident, and also what didn’t.
Who?
By applying a blameless policy on your postmortem, it becomes a shameless and a non fingers point meeting valuable for everyone in term of learning and increasing self efficiency.
Why?
By acting on both human side and processes, you end up creating a quality postmortem document and a better knowledge base which contributes to enhance the production quality.
When?
A postmortem should be raised to any critical issues including financial loss, huge technical event, etc… and done when the mitigation or the resolution is in place. So as soon as possible to have people who still have the incident in mind.
Where?
Wherever you want: in fact, the subject and the people involved are the only real prerequisites to manage a postmortem. The communication method is up to you (mail, website, bluejeans, slack, physical meeting,…).
The postmortem document
The postmortem document contains the following informations (to adapt to your business):
Header of the incident:
- Postmortem Title: usually incident ticket number,
- Associated issues: related ticket to your incident,
- Associated applications: applications or dependencies impacted by the issue,
- Priority of incident: P0, P1, P2, P3 or Minor, Major, Critical, Blocker
- Impact: user or/and business impact with a small description,
- Business Impact: xxK€ loss,
- Detection date: date detection of the issue, often when the issue was discovered by a human or an alert,
- Start date: start date of the issue. The beginning of the issue,
- End date: end date of the issue, often when the issue is resolved,
- Related teams: teams involved in the incident resolution,
- Invited: name of people invited to the postmortem meeting
Timeline of the incident:
- Date DD/MM/YYYY HH:MM: the quick explanation of what happened.
This is the incident overview through time. Just keep the most important facts, not all actions. By convention and if you work in an international context, it’s better to use UTC time.
Remediation of the incident:
- How the team(s) mitigated or/and solved the incident. Make it actionable.
Correction/Prevention/Postactions of the incident:
We use a table to describe all our prevention actions with 3 columns:
- Task description;
- Due date;
- Ticket number
A common checklist can be put in place like this:
- monitoring;
- alerting;
- procedure;
- documentation;
- tests post releases;
- unit tests;
- capacity planning
In order to have a global and complete overview of the defined post actions, whether they’re on your team’s scope or not, you must put a specific label in your tracking system. All actions must have a deadline and we use the 5 Whys interrogative technique to go to the deepest.
Conclusions of the incident:
- Lessons learned: Answer at questions what worked and what not worked.
- Root Cause: The root cause must be clear and in one sentence
The postmortem meeting
First, you need to schedule the postmortem meeting once the document is drafted. Careful to short notice scheduling.
Usually, this kind of meeting takes 30 to 45 minutes maximum in order to be quick and brief. A sample agenda to be effective can be:
- 2-3 minutes to explain the format and initial expectations. Ensure a quick overview of the postmortem header and the timeline,
- 10 minutes to drag the timeline,
- 20–30 minutes to discuss with people the root cause and post actions needed. Take more time to confirm the root cause (understood and accepted) and the most important, actions of non reoccurrence.
Then don’t hesitate to use materials like TV screen or board to show and confirm the draft with participants. Don’t be afraid of cutting down an off-topic and challenge all participants on their ideas. Team members are accountable but not responsible and not being negatively judged. It’s a very important fact. If you’re looking for punishment, you’re creating fear and fear causes people to hide facts or the truth. And the one thing you want from a postmortem is the facts.
At least, the postmortem meeting can also be done on demand.
Finally
Post mortems will (ideally of course):
- have a clear document stored and approved by each other with a header, a timeline, a root cause and actionable points,
- be done quickly after a significant failure,
- not focused on blaming someone,
- focused on RCA (root-cause analysis) in order to tackle actions of non-reoccurrence,
- have tickets to implement changes which will ensure the changes are made.
Make it available: you can now publish your postmortem and notify people (attendees and top management also). From my point, the success of a postmortem is measured by the team‘s focus on the action plans to avoid any reoccurrence. The more people will tackle post actions, the less incidents you will receive.
Happy Post mortem…