Your team's incident doc folder has 47 files. The most-used is the second one ever written. The other 45 might as well not exist. You know this is bad but rewriting them sounds worse.

The fix isn't a better doc tool — it's a tighter format. A four-field troubleshooting card: the error message, the root cause, the fix, how to verify the fix worked. Card-sized. Greppable. Reusable. Works equally well as a human runbook page and as context you paste into an AI agent.

Here's the format and how to actually use it.

The four fields, in detail

Error. Verbatim — the exact string a developer would search for. Not paraphrased. Include the stack frame name if it's the most distinctive part. The whole point of this field is that someone hitting the same error in the future can grep for it.

Cause. One sentence on the root cause. Not the symptom. "Connection pool exhausted because the ORM doesn't release connections in error paths," not "too many connections." If you can't write the cause in one sentence, the diagnosis isn't done yet.

Fix. The actual change that resolves it. Code, config, command. Not principles. "Wrap the request handler in a try/finally that calls pool.release()," with a code block. Not "use proper connection lifecycle management."

Verify. The single check that proves the fix worked. Could be a SQL query, a curl, a log search. Specific enough that you can copy-paste it. "Run SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction' — should be near 0 under load."

A worked example

ERROR
  django.db.utils.OperationalError: too many clients already

CAUSE
  Long-running ORM transactions held open in error paths exhaust
  the Postgres connection pool.

FIX
  In settings.py, set CONN_MAX_AGE = 60 (was 0). For the specific
  view that triggered this (UserExportView), add an explicit
  transaction.atomic() block with on_commit handlers, and ensure
  the export streams (StreamingHttpResponse) close properly.

  See PR #4421 for the diff.

VERIFY
  Under load test (locust 50 users), pg_stat_activity should show
  < 30 connections instead of saturating at max_connections=100.
  Specifically:
    SELECT count(*) FROM pg_stat_activity WHERE datname = 'app_db';
  Should stabilize, not climb monotonically.

That's a complete card. Five minutes to write. Saves the next person an hour.

Why this format wins over long incident docs

Three properties:

Greppable. Future you searches for the error string. The card pops up. The cause and fix are in the same screen. No clicking into a 12-section incident doc.

Mergeable. New variants of the same error append to the same card under "VARIANTS" instead of creating new docs. The card grows; the file count doesn't.

AI-friendly. Paste a card into Claude/Codex as context: "here's how we've handled this error before." The four fields map exactly to the structure an LLM needs to reproduce the fix in a new context. Long incident docs are noise for an agent.

What gets out of scope

Things that explicitly DON'T belong in a card:

The full incident timeline ("at 3:14 we noticed...")
Stakeholder communications
Postmortem analysis on the team's process
Forward-looking action items

All of those have value, but in different documents. The card is for the next person who hits the same error. Keep it that focused.

Where to put cards

A flat directory of .md files, named by error class:

runbook/
├── django-too-many-clients.md
├── nginx-502-upstream-prematurely-closed.md
├── celery-worker-not-consuming.md
├── docker-no-space-left-on-device.md
└── postgres-deadlock-detected.md

Each file is one card. Greppable from the command line, browsable from your editor, viewable on the team wiki via auto-render. No taxonomy, no categories — the filename IS the index.

For larger teams, a runbook/ git repo per service. Cards stay close to the code; PRs add new cards as you encounter new errors.

How to start

Don't try to backfill. Don't write 30 cards in a sprint. Start the next time you fix something that took more than 30 minutes to diagnose. Open a card. Five minutes. Push it.

After a month you'll have 5-10 cards. After three months, 20-30. After a year, your team's tribal knowledge is searchable. Long incident docs continue to gather dust because nobody opens them; cards get opened because they're optimized for the only path that actually happens — "I have this error, what do I do?"

Using cards as agent context

If you're using AI coding agents (Claude Code, Codex CLI, etc.), the runbook cards are the highest-value context you can give them. When debugging a production issue, paste relevant cards into the agent's context window:

> here's our existing runbook for similar errors:
> [paste 2-3 cards]
>
> we're now hitting [new error]. propose diagnosis.

The agent's first reply tends to be much sharper because it has examples of how your team thinks about similar problems. It also avoids generic StackOverflow-grade suggestions because the cards demonstrate your team prefers specific, verified fixes.

For 1DevTool users specifically: the per-project notes can hold these cards, and the multi-agent setup means one terminal can be debugging while another is searching the cards for similar prior incidents. The card format is the part that makes the rest work.

The harder part isn't the format

Adopting the format is easy. The harder part is the discipline to write the card during the fix, while the cause is fresh. Two days later you've already forgotten which of the eight things you tried was the actual fix vs noise. Right after the resolution, before you context-switch, is the only time you have the right information loaded.

The team norm to bake in: a fix isn't done until the card is written. PR doesn't merge without the card. Five minutes well spent.