08. Prompt chaining — break one hard brief into smaller jobs¶

~13 min read. When one prompt becomes a kitchen sink, split the work into stages.

Built on the ELI5 in 00-eli5.md. The Work order — one job sheet for one task — often works better when we turn one giant order into a sequence of smaller orders.

Why one giant prompt often fails¶

Look. A single prompt may need to classify intent, retrieve policy, apply rules, format output, and choose next action. That is a lot. The model can do it sometimes. But as tasks pile up, failure modes multiply. The answer becomes harder to debug. When it fails, you do not know which substep broke.

Picture first.

one giant prompt
┌──────────────────────────────────────┐
│ classify + reason + cite + route     │
│ summarize + format + tone + refuse   │
└─────────────────┬────────────────────┘
                  ▼
           one opaque answer

prompt chain
┌───────────┐   ┌───────────┐   ┌───────────┐
│ step 1    │→  │ step 2    │→  │ step 3    │
│ classify  │   │ retrieve  │   │ answer    │
└───────────┘   └───────────┘   └───────────┘

Simple, no? A chain lets each prompt do one clean thing. That means each Work order stays short. Each Reply form stays tight. Each failure becomes easier to locate. That is why chaining feels like systems design, not just prompt writing.

Now what is the problem? Chains add complexity too. More steps mean more latency. More steps mean more orchestration code. A bad intermediate output can poison later steps. So what to do? Chain only when decomposition creates real clarity. Do not split for vanity.

Good chains use explicit handoffs¶

The output of one stage becomes the input of the next. That handoff should be structured. If step one returns fluffy prose, step two must parse fluffy prose. Bad idea. Use clear state objects, labels, or tagged blocks. The Reply form of one stage should match the Work order of the next stage. That is the handoff contract.

step 1 output                       step 2 input
┌────────────────────┐             ┌──────────────────────┐
│ intent: billing    │   ───────→  │ route using intent   │
│ urgency: medium    │             │ retrieve billing docs│
└────────────────────┘             └──────────────────────┘

A clean chain often has three properties. First, each stage has one dominant goal. Second, intermediate outputs are small and typed. Third, later stages never need to reinterpret vague text from earlier stages. That keeps the chain sane.

You can also insert tool steps. Maybe step one extracts search terms. A retrieval tool fetches documents. Step two answers using only retrieved context. That is still prompt chaining. The chain is not only model-to-model. It can be model-to-tool-to-model.

Common decomposition patterns¶

Pattern one is classify then answer. This works for support routing, compliance triage, and many agent front doors. Step one picks the lane. Step two applies the lane-specific prompt.

Pattern two is plan then execute. Useful for coding, research, and multi-tool agents. One prompt creates a short plan. Later prompts or tools execute each step. This reduces instruction overload in one giant prompt.

Pattern three is extract then decide. For example, step one pulls dates, plan type, and amounts from messy text. Step two applies business rules to those fields. That is usually safer than asking one prompt to read mess and decide policy in one leap.

See the visual.

raw input
   │
   ▼
┌──────────────┐
│ extract facts│
└──────┬───────┘
       ▼
┌──────────────┐
│ apply rules  │
└──────┬───────┘
       ▼
┌──────────────┐
│ craft answer │
└──────────────┘

Worked example — refund workflow as a chain¶

Suppose you are building a refund assistant. User message: "Our enterprise renewal happened two weeks ago. We already used around 6,500 API calls. Can we still request a refund?" A one-shot prompt might work. But let us chain it.

Step one extracts facts.

[SYSTEM]
Extract case facts.
Return JSON with keys: plan_type, days_since_renewal, api_calls_since_renewal, user_goal.
Return only JSON.

[USER]
Our enterprise renewal happened two weeks ago. We already used around 6,500 API calls. Can we still request a refund?

Possible step-one output.

{
  "plan_type": "enterprise",
  "days_since_renewal": 14,
  "api_calls_since_renewal": 6500,
  "user_goal": "request_refund"
}

Step two applies policy.

[SYSTEM]
You are a billing policy evaluator.
If api_calls_since_renewal > 5000, set eligibility to not_eligible.
Return JSON with keys: eligibility, reason_code.

[INPUT]
{"plan_type":"enterprise","days_since_renewal":14,"api_calls_since_renewal":6500,"user_goal":"request_refund"}

Possible step-two output.

{
  "eligibility": "not_eligible",
  "reason_code": "usage_limit_exceeded"
}

Step three writes the user-facing answer.

[SYSTEM]
You are a billing support assistant.
Write a calm answer in two bullets.
Do not invent policy details.

[INPUT]
Facts: {"plan_type":"enterprise","days_since_renewal":14,"api_calls_since_renewal":6500}
Decision: {"eligibility":"not_eligible","reason_code":"usage_limit_exceeded"}

Possible step-three output.

- Eligibility: The account is not eligible for a refund under the provided policy because usage since renewal exceeds the refund limit.
- Next step: If you want a manual review, contact billing operations with the renewal and usage details.

Simple, no? Each stage had one clean Work order. Each stage had a crisp Reply form. If the final answer is wrong, you inspect extraction, then decision, then wording. That is far easier than debugging one giant blob.

Chain design tradeoffs¶

Now what is the problem? Chains can amplify upstream mistakes. If extraction gets the date wrong, policy evaluation may be perfectly wrong. So keep intermediate states inspectable. Log them. Version them. Test them separately.

Latency matters too. Three model calls cost more than one. Sometimes a smaller model can handle step one. A larger model can handle step three. That is a useful production trick. Chains let you allocate model budget by step.

The Revision ledger should record chain structure, not only prompt text. A chain version is more than one file. It is prompt A, prompt B, the handoff schema, and the routing rule between them. Treat the whole pipeline as the unit of change.

Where this lives in the wild¶

RAG answer systems like Perplexity or Glean — one stage rewrites the query, another retrieves evidence, and a later stage answers with citations.
GitHub Copilot agentic flows — planning, tool selection, code edits, and final explanation are often separate stages because one mega-prompt is harder to control.
Intercom Fin support orchestration — intent detection, knowledge lookup, and final customer reply can be separated so each stage uses a tighter prompt.
Harvey legal workflows — extraction of clauses, issue spotting, and draft generation are often cleaner as chained prompts with typed handoffs.
Enterprise document-processing pipelines — OCR cleanup, field extraction, validation, and action recommendation are common prompt-chain stages owned by different teams.

Pause and recall¶

Why can one giant prompt become hard to debug?
What makes a good handoff between prompt stages?
Which common decomposition patterns appear in production systems?
Why should chain versions include schemas and routing, not just prompt text?

Interview Q&A¶

Q: Why choose prompt chaining instead of a single comprehensive prompt? A: Chaining can isolate subtasks, reduce instruction overload, and make failures easier to locate. It is useful when the task naturally decomposes into stable stages.

Common wrong answer to avoid: "Because more prompts always mean more accuracy." Extra steps help only when decomposition creates clarity. Otherwise they just add latency.

Q: Why should intermediate outputs be structured instead of free-form prose? A: Structured handoffs reduce parsing ambiguity and prevent later stages from reinterpreting vague text. They turn the chain into explicit contracts.

Common wrong answer to avoid: "Because JSON looks more professional." The real value is dependable orchestration.

Q: Why can a chain produce confidently wrong answers even when each step seems reasonable? A: An early extraction or routing error can propagate cleanly through later steps, so the pipeline becomes consistently wrong unless intermediate states are inspected.

Common wrong answer to avoid: "Later stages will usually correct earlier mistakes." Sometimes they do, but you should not rely on that.

Q: Why might different stages use different models or settings? A: Different subtasks have different cost, latency, and reasoning needs. Chains let teams reserve stronger models for the stages that truly need them.

Common wrong answer to avoid: "Using multiple models is always overengineering." It depends on task economics and control needs.

Apply now (5 min)¶

Exercise. Take one overloaded prompt from your domain. Split it into three verbs. For example, extract, decide, respond. Write one field-level handoff between each stage. Then note where Reply form of one step becomes Work order for the next.

Sketch from memory. Draw three boxes in a row. Name each stage. Write the intermediate JSON object between them. Circle the box you would test first if results looked wrong.

Bridge. Chains make prompts easier to control. But now you have many prompts, not one. So next we need versioning discipline, or the whole system becomes impossible to maintain. → 09-prompt-versioning.md