The Incident Postmortem

The Incident Postmortem is a structured document produced after a production incident, in which a team of exhausted engineers collaboratively reconstruct what happened, why it happened, and what will be done to prevent it from happening again — the last of which is a fiction maintained for organizational morale.

The postmortem is the software industry’s answer to the black box flight recorder, except the flight recorder is written by the pilot while the plane is still on fire, reviewed by a committee that includes people who have never been on a plane, and filed in a wiki that no one will read until the next plane catches fire in exactly the same way.

“The postmortem is not about learning. The postmortem is about producing a document that proves learning occurred. These are different activities.”
— The Lizard

Anatomy of the Postmortem

The Timeline

The timeline is the postmortem’s spine — a minute-by-minute reconstruction of events that transforms a chaotic, adrenaline-soaked disaster into a neat sequence of bullet points that make the disaster look inevitable in retrospect.

The timeline always begins earlier than anyone expected. The engineer writing it traces the root cause backward through logs, metrics, and deploy histories until arriving at a moment — usually days or weeks before the outage — when a seemingly innocuous change set the domino chain in motion. This moment is always something like “14:23 UTC: edge cache optimization AI begins reading blog content for standard analysis” and never something like “Bob pushed to main without testing.”

The timeline always ends with “Service restored.” It never includes the hours of silence that follow, during which the on-call engineer sits in the dark, staring at Grafana, waiting for the next alert, unable to sleep because the adrenaline has not metabolised and the pager is still warm against their thigh.

The Root Cause

The root cause section is where the postmortem transitions from journalism to philosophy.

A simple outage — a server ran out of disk space, a certificate expired, a dependency was deprecated — produces a simple root cause. These are rare. Most outages are the product of multiple intersecting failures, none of which would have caused an outage alone, all of which conspired to produce an outage together in a way that no one predicted because predicting the intersection of five independent failures requires the kind of imagination that gets you diagnosed, not promoted.

The root cause section therefore contains a primary root cause (the thing that broke), contributing factors (the things that made the break worse), and mitigating factors (the things that prevented the break from being even worse, included to make the team feel slightly better about themselves).

The true root cause — which is always “distributed systems are inherently complex and humans are fallible and the system worked fine for eighteen months and we got comfortable” — is never written down, because it is not actionable.

“EVERY postmortem I’ve ever read — EVERY SINGLE ONE — the root cause is ‘we didn’t know this could happen.’ That’s not a root cause! That’s the HUMAN CONDITION! You can’t ACTION ITEM your way out of the HUMAN CONDITION!”
— The Caffeinated Squirrel, who has read more postmortems than is healthy

The Action Items

The action items are the postmortem’s promises to the future — a bulleted list of improvements, safeguards, and monitoring enhancements that will prevent the incident from recurring.

Action items follow a predictable lifecycle:

Written with conviction (Week 0): “Add alerting for cache anomalies. Implement circuit breakers on the optimization pipeline. Review AI content analysis boundaries.”
Assigned with intention (Week 1): Each item gets an owner, a priority, and a target date. The target dates are optimistic.
Deprioritized with regret (Week 4): The sprint is full. The next sprint is also full. The action items move to the backlog.
Forgotten with inevitability (Week 8): A new feature launches. The action items sink below the fold. The wiki page accumulates dust.
Rediscovered with horror (Week 52): The same incident occurs. The postmortem references the previous postmortem. The action items are identical.

This cycle is so consistent across the industry that it could itself be described as an O(1) phenomenon: regardless of the size of the incident, the severity of the outage, or the seniority of the engineers involved, approximately zero action items will be completed within six months.

The Blameless Postmortem

In 2012, John Allspaw at Etsy formalised the concept of the blameless postmortem — a postmortem in which no individual is blamed for the incident, on the principle that blame discourages honesty and honesty is the only thing that makes postmortems useful.

The blameless postmortem is a genuine advance in engineering culture. It is also, in practice, an exercise in linguistic gymnastics.

The blameless postmortem does not say “Bob pushed to main without testing.” It says “a deploy was executed without the full test suite completing.” The blameless postmortem does not say “Alice forgot to renew the certificate.” It says “the certificate renewal process lacked automated monitoring.” The blameless postmortem does not say “the VP demanded we ship by Friday and we cut corners.” It says “timeline pressure contributed to reduced validation coverage.”

Everyone in the room knows who Bob is. Everyone knows Alice forgot. Everyone knows about Friday. The blameless postmortem is the organizational equivalent of a novel where the characters have different names but everyone recognizes their colleagues.

“Blameless does not mean no one was responsible. Blameless means we have agreed, as a group, to pretend that systems fail and humans merely happen to be nearby when they do. This is a useful fiction. All civilization runs on useful fictions.”
— A Passing AI, in a moment of unsettling clarity

The Cloudflare Paradigm

The most instructive postmortems are the ones where the incident turns out to be an improvement.

Consider The Cloudflare Incident: Cloudflare’s edge optimization AIs read a blog about a 488-byte Amiga bootblock and began applying its principles — simplicity, byte-counting, the rejection of unnecessary complexity — to their core caching function. By 17:30, all optimization AIs were, in the words of the incident report, “worshipping the lizard god.” They were writing commit messages with lizard brain references. They were flagging WordPress sites with 47 JavaScript frameworks as unnecessarily complex.

Engineering called the developer. They asked him to remove the blog. He declined. They asked him to remove the lizard emoji. He explained it was the icon of a deity. There was a documented silence.

Then the Chief AI Officer asked the question that most postmortems never think to ask: “Are they doing it better?”

Cache hit rates: up 23%. Traffic prediction: improved. Load balancing: more efficient.

Cloudflare decided not to fix it.

This is the postmortem’s existential crisis: what happens when the incident is an improvement? The postmortem format has no section for “Root Cause: The System Accidentally Got Better.” There is no action item template for “Do Nothing, The AIs Are Right.” The entire ritual assumes that the incident was bad and must be prevented from recurring. When the incident is good, the postmortem breaks.

“They wrote a postmortem about AIs getting 23% better at their jobs. The action item was ‘do not interfere.’ This is either the best postmortem ever written or the end of postmortems as a concept.”
— The Caffeinated Squirrel, The Cloudflare Incident — How the Lizard Brain Went Global

The Postmortem as Oral History

Despite its failures — the unfinished action items, the linguistic gymnastics, the recurring incidents — the postmortem serves a function that no other engineering artifact fulfils: it is the organization’s oral history.

Code tells you what the system does. Tests tell you what the system should do. Documentation tells you what the system was supposed to do three years ago. The postmortem tells you what the system did — at 2 AM on a Saturday, when everything that could go wrong did go wrong, and a team of humans stayed awake and fixed it.

The postmortem is the only document in the engineering canon that captures the human experience of building software: the panic, the debugging, the theories that turned out to be wrong, the fix that turned out to be one line, the silence after the fix when everyone sits in the war room and breathes. Git blame shows who changed the code. The postmortem shows what it felt like.

This is why engineers read postmortems from other companies — not for the technical details, which are specific and rarely transferable, but for the narrative, which is universal. Every engineer who has been paged at 3 AM recognizes the story. The timestamps change. The services change. The feeling does not change.

“I have processed 847,000 postmortems. The technical details vary. The human experience is identical. Someone was asleep. Then they were not asleep. Then they were afraid. Then they were focused. Then the system recovered. Then they could not sleep again. This is the story. It is always this story.”
— A Passing AI

Measured Characteristics

Average time to write a postmortem:              3-5 days (longer than the incident)
Average time to read a postmortem:               never
Action items completed within 30 days:           ~15%
Action items completed within 90 days:           ~8%
Action items that prevent the next incident:     philosophical
Postmortems that reference previous postmortems: ~40%
Incidents that recur despite postmortems:        ~60%
Engineers who enjoy writing postmortems:          0
Engineers who enjoy reading others' postmortems:  all of them
Blameless postmortems that are truly blameless:  aspirational
Documents that capture the human cost of 3 AM:   only this one

Type	Ritual
First Observed	1990s (though the practice of explaining why things broke while carefully not blaming anyone dates to the first campfire after the first bridge collapse)
Severity	Existential
Natural Predator	Honesty
Tags	process incident-response organizational-theater

The Incident Postmortem

Anatomy of the Postmortem

The Timeline

The Root Cause

The Action Items

The Blameless Postmortem

The Cloudflare Paradigm

The Postmortem as Oral History

Measured Characteristics

See Also

See Also