Skip to content
AI Engineering
AI Engineering11 min read0 views

A Postmortem Template for AI Agent Incidents

Standard SRE postmortems miss the half of an AI incident that matters: why did the agent decide that. Here's the template CallSphere has run for 11 production incidents in 12 months.

TL;DR — A good AI postmortem has eight sections. The one most teams skip is "Why didn't we detect this sooner?" Median time-to-detect for agent incidents is 14 days, not 14 minutes.

What goes wrong

flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
CallSphere reference architecture

Teams run agent incidents through their old Google-style postmortem template. They find the bug in code or a prompt, write up "we shipped a bad change, we'll add a regression test," and move on. They miss two things:

  1. The detection chain — agent failures look identical to authorized activity in audit logs. Without a tripwire designed for agent-distinct patterns, the incident shows up only when a customer notices.
  2. Model behavior root cause — the model is a non-deterministic dependency. "Why did the model choose this tool?" is part of the incident, not a footnote.

In 2025, one widely-shared postmortem covered an agent that burned $4,200 in 63 hours before anyone noticed. The detection was a credit-card alert. That's classic AI-agent failure mode.

How to monitor

Adopt a postmortem template with these eight sections:

  1. Summary & impact — what happened, who was affected, dollar/customer impact.
  2. Timeline — UTC timestamps from first symptom to resolution.
  3. Detection chainhow did we find out, and what would have to change for the next instance to be caught in 4 hours not 14 days.
  4. Root cause — both code/config AND model behavior cause if applicable.
  5. What went well.
  6. What went wrong.
  7. Action items — owner, due date, blast-radius lever (test, runbook, alert, code, prompt, eval).
  8. Blameless lessons.

Publish the postmortem to a public repo or wiki. Read it at the next all-hands. Track action item completion in Linear/Jira.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

CallSphere stack

We've run 11 postmortems in 12 months across our six verticals. The template lives in /docs/postmortems/ in the monorepo as Markdown — every PM is a PR. Senior engineers review every PM within 48 hours. Action items become Linear tickets with the postmortem URL in the description.

  • Healthcare FastAPI :8084 — biggest incident was a prompt regression that increased hallucination of insurance plan names. Detection was a customer email; we now run an LLM-as-judge eval daily on a fixed test set.
  • Real Estate 6-container NATS pod — message-loss incident when NATS upgraded; we added queue-depth alerts and a chaos drill.
  • Sales WebSocket / PM2 — restart storm when memory leaked; capped worker memory and added rolling restarts.
  • After-hours Bull/Redis queue — Redis OOM during a backlog; added queue size budgets.

Every postmortem ends with a published detection_chain_minutes field. Median across 11 incidents went from 47 hours (first 5) to 38 minutes (last 6) once we made detection a first-class outcome.

$1499 enterprise tier on /pricing gets a copy of any postmortem that affects them within 72 hours. Try it on the 14-day trial.

Implementation

  1. Template lives in Git. New incident → cp template.md YYYY-MM-DD-short-title.md. PR. Review.

  2. Detection chain section is mandatory.

## Detection chain

- 2026-04-12 14:02 UTC — first impacted call
- 2026-04-12 16:51 UTC — automated alert fires (FTL p95 > 1500ms for 30m)
- 2026-04-12 16:53 UTC — on-call ack
- detection_chain_minutes: 169

### What would have detected this in <30 minutes?
- A 5-minute window FTL p95 alert (we had only 30m)
- An LLM-as-judge eval on a fresh sample (we ran daily, should be hourly)
  1. Action items have owners and dates. No "we should consider…"

    Still reading? Stop comparing — try CallSphere live.

    CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  2. Root cause is two-pronged.

## Root cause

### Code
- Prompt change in PR #4421 added a new tool description that overlapped with two existing tools.

### Model
- gpt-4o-realtime preferred the new tool 38% of the time even when the old tool was correct, because the new description matched the user phrase more literally.
  1. Track meta-metrics. Median detection time. Repeated incident classes. Action item completion rate. Aim for >90% closed within 30 days.

FAQ

Q: Should the model vendor be in the postmortem? A: Yes if their behavior was a contributing factor. We've named OpenAI in two postmortems.

Q: How do I keep it blameless? A: Focus on systems, not people. "Our deploy process didn't catch this" not "Alice missed it."

Q: Are postmortems public? A: Internally always; externally for SEV1 with customer impact. We publish redacted versions.

Q: How long is too long for a postmortem? A: Aim for 1500 words. Longer ones don't get read. Link to the trace and the eval results.

Q: Should I use an AI to draft the postmortem? A: A summarizer that pulls from incident channel, traces, and PRs is fine — a human writes the lessons.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.