11. Prompt debugging — stop guessing why the model did that¶

~14 min read. Good debugging turns "weird output" into a concrete failure hypothesis.

Built on the ELI5 in 00-eli5.md. The Revision ledger — the record of changes and failures — helps only when we diagnose prompts systematically.

Start with the symptom, not the theory¶

Look. A model gives a bad answer. Many people react with a grand theory. "The model forgot everything." "The model ignored the system prompt." "The model cannot reason." Slow down. First isolate the symptom. Was the answer wrong? Was the format wrong? Was the refusal wrong? Was the citation wrong? Different symptoms imply different causes.

Picture first.

bad output
   │
   ├──→ wrong facts
   ├──→ wrong format
   ├──→ wrong boundary
   └──→ wrong tool / route

Simple, no? Prompt debugging starts like software debugging. You classify the failure. Then you test hypotheses. Do not rewrite the whole prompt immediately. That only destroys evidence.

Now what is the problem? Prompt behavior is distributed. The Standing rulebook, Sample deliverables, Work order, Reply form, and Creativity dial all interact. A failure can come from any of them. So debugging must be systematic.

Read the prompt like the model reads it¶

See. The model does not read with your human intentions. It reads tokens in order. So inspect order, proximity, conflicts, and salience. Which instruction comes first? Which comes last? Which rule is buried in the middle of a long blob? Which example contradicts the current request? These questions often expose the bug.

prompt order scan
┌────────────────────────────┐
│ system rules              │
│ examples                  │
│ user request              │
│ output constraints        │
└────────────────────────────┘
        ▲            ▲
        │            │
   primacy effects   recency effects

Now what is the problem here? A rule may exist, but still be weakly attended. Maybe the refusal line is one sentence inside a long persona paragraph. Maybe the user pasted hostile text after the examples. Maybe the output schema appears before ten giant context blocks and gets forgotten. Debugging means checking what the model likely attended to most.

Token-level attention maps are rarely available in everyday product tooling. That is fine. You can still reason about salience. Short, explicit, labeled instructions often beat hidden clauses. Recent context often matters more than mid-prompt clutter. Conflicting examples often overpower abstract rules. These are practical debugging heuristics.

Build minimal reproductions¶

Look. If the bad behavior happens only in a giant production transcript, shrink it. Create the smallest prompt that still reproduces the failure. This is the prompt equivalent of a minimal failing test. Remove irrelevant context. Keep the critical example. Keep the offending user message. Keep the format instruction. Then rerun.

full production trace
        │
        ▼
strip noise
        │
        ▼
minimal failing prompt
        │
        ▼
change one variable at a time

So what to do after you have a repro? Change one thing only. Move the schema lower. Or remove the contradictory example. Or lower temperature. Or add a negative example. Then rerun. If you change three things together, you learned little.

The Revision ledger helps here. You can compare the failing prompt version with the last stable one. Maybe only one example changed. Maybe one field name changed. Maybe the model changed too. That narrows the search.

Worked example — parser break from hidden drift¶

Suppose a triage model should return one label only. Live bug report: The parser started failing on 12% of requests. The prompt looks like this.

[SYSTEM]
Classify the message as billing, bug, feature_request, or account_access.
Be helpful.

[EXAMPLES]
User: I cannot sign in.
Label: account_access

[USER]
I was billed twice after renewal.

[OUTPUT]
Return the label only.

Possible bad model response.

billing — this seems related to duplicate payment after renewal

Now what is the likely issue? The Reply form says label only. But the system also says, "Be helpful." That soft style instruction encourages extra commentary. The conflict is small, but the parser feels it.

Minimal reproduction confirms it. Now change one thing.

[SYSTEM]
Classify the message as billing, bug, feature_request, or account_access.
Return exactly one label and nothing else.

[EXAMPLES]
User: I cannot sign in.
Label: account_access

[USER]
I was billed twice after renewal.

Possible fixed response.

billing

Simple, no? We did not need mysticism. We found a local conflict. A vague helpfulness cue fought the strict output rule. The fix was to make the Reply form dominate clearly.

A practical debugging checklist¶

When output goes wrong, check these in order. First, was the task itself ambiguous? Second, do instructions conflict? Third, do examples teach the wrong pattern? Fourth, is the output contract explicit enough? Fifth, did sampling settings add drift? Sixth, did retrieval or tools supply bad evidence?

This order is useful because it moves from intent to implementation. Most prompt bugs are not magical. They are one of these six. If you debug with a checklist, you will fix faster. If you debug with vibes, you will keep rewriting whole prompts forever.

Where this lives in the wild¶

LangSmith and prompt-observability teams — engineers inspect traces, prompts, and outputs step by step because prompt failures are easier to diagnose with structured logs than with anecdotes.
GitHub Copilot product teams — prompt regressions are often debugged by reducing the issue to a minimal reproduction in a specific coding context instead of blaming the whole model stack.
Intercom Fin — support-ai teams compare failing transcripts against prompt versions and examples to see whether policy drift or formatting drift caused a bad answer.
Anthropic or OpenAI application builders — prompt debugging often involves moving critical instructions to more salient positions and testing whether behavior changes under the same model settings.
Enterprise agent platforms — routing, retrieval, and answer prompts are debugged stage by stage because one opaque end-to-end trace hides the real culprit.

Pause and recall¶

Why should prompt debugging start with failure classification?
What does it mean to read the prompt like the model reads it?
Why is a minimal failing prompt so useful?
Which six checks form a practical debugging checklist?

Interview Q&A¶

Q: Why is changing one variable at a time essential in prompt debugging? A: Because prompt behavior has many interacting parts. If you change several at once, you cannot tell which intervention actually fixed or worsened the issue.

Common wrong answer to avoid: "Because prompt bugs are simple." They are often multi-factor. One-variable changes are about learnability, not simplicity.

Q: Why can a rule be present in a prompt and still fail behaviorally? A: A rule may be weakly salient, contradicted by examples, buried in clutter, or overridden by more recent context. Presence is not the same as effective control.

Common wrong answer to avoid: "If the sentence exists, the model must obey it." Models condition on all context, not on your intention alone.

Q: Why is a minimal reproduction better than debugging the full production trace first? A: It isolates the causal ingredients of the failure and makes experiments faster, cheaper, and easier to interpret.

Common wrong answer to avoid: "Because the full trace is too long to read." Length is inconvenient, but the deeper reason is causal isolation.

Q: Why should prompt debugging inspect retrieval, tools, and schemas too, not only wording? A: Many apparent prompt failures are actually evidence failures, routing failures, or output-contract failures. The prompt lives inside a larger system.

Common wrong answer to avoid: "Because every AI bug is a prompt bug." Many are not.

Apply now (5 min)¶

Exercise. Take one bad model output you remember. Classify the symptom first. Was it wrong facts, wrong format, wrong boundary, or wrong route? Then write the smallest prompt that could still reproduce it. Change only one line.

Sketch from memory. Draw the debugging ladder. Put symptom at the top. Then order, conflict, examples, format, sampling, and evidence below it. Label the ladder as systematic diagnosis.

Bridge. Debugging tells us why ordinary prompts fail. But hostile inputs fail in special ways. So next we design prompts that resist injection, override attempts, and unsafe context mixing. → 12-prompt-security.md