IT Post Mortem Guidelines

Guess the Five Ways (5W)

What?

Who?

Why?

When?

Where?

The postmortem document

Header of the incident:

  • Postmortem Title: usually incident ticket number,
  • Associated issues: related ticket to your incident,
  • Associated applications: applications or dependencies impacted by the issue,
  • Priority of incident: P0, P1, P2, P3 or Minor, Major, Critical, Blocker
  • Impact: user or/and business impact with a small description,
  • Business Impact: xxK€ loss,
  • Detection date: date detection of the issue, often when the issue was discovered by a human or an alert,
  • Start date: start date of the issue. The beginning of the issue,
  • End date: end date of the issue, often when the issue is resolved,
  • Related teams: teams involved in the incident resolution,
  • Invited: name of people invited to the postmortem meeting

Timeline of the incident:

  • Date DD/MM/YYYY HH:MM: the quick explanation of what happened.

Remediation of the incident:

  • How the team(s) mitigated or/and solved the incident. Make it actionable.

Correction/Prevention/Postactions of the incident:

  • Task description;
  • Due date;
  • Ticket number
  • monitoring;
  • alerting;
  • procedure;
  • documentation;
  • tests post releases;
  • unit tests;
  • capacity planning

Conclusions of the incident:

  • Lessons learned: Answer at questions what worked and what not worked.
  • Root Cause: The root cause must be clear and in one sentence

The postmortem meeting

  • 2-3 minutes to explain the format and initial expectations. Ensure a quick overview of the postmortem header and the timeline,
  • 10 minutes to drag the timeline,
  • 20–30 minutes to discuss with people the root cause and post actions needed. Take more time to confirm the root cause (understood and accepted) and the most important, actions of non reoccurrence.

Finally

  • have a clear document stored and approved by each other with a header, a timeline, a root cause and actionable points,
  • be done quickly after a significant failure,
  • not focused on blaming someone,
  • focused on RCA (root-cause analysis) in order to tackle actions of non-reoccurrence,
  • have tickets to implement changes which will ensure the changes are made.

--

--

VP Engineering @akeneopim. Follow me on https://twitter.com/kwa29 or check out some of my initiatives at https://kwa29.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store