Skip to content

03. Few-shot placement — where examples teach fastest

~13 min read. Good examples are expensive tokens, so their position must earn its keep.

Built on the ELI5 in 00-eli5.md. The Sample deliverables — example outputs for the contractor — teach pattern faster than abstract advice alone.


Why examples beat abstract instructions

See. If you say, "be concise," the model guesses what concise means. If you show one compact answer, the guess space shrinks. If you show three varied compact answers, the pattern becomes even clearer. Few-shot prompting works because examples demonstrate the target distribution. They show task shape, tone, structure, and boundary behavior together.

Picture first.

abstract rule only                    rule + examples
┌───────────────────────┐             ┌────────────────────────┐
│ be concise            │             │ be concise             │
│ cite sources          │             │ ex1: short + cited     │
│ stay factual          │             │ ex2: short + cited     │
└──────────┬────────────┘             └───────────┬────────────┘
           ▼                                      ▼
    model guesses pattern                  model copies pattern

Simple, no? The Sample deliverables show what success feels like in concrete form. They also expose hidden constraints. Maybe the answer should start with a verdict. Maybe bullets should use a fixed order. Maybe refusals should be calm and direct. Examples can teach all that at once.

Now what is the problem? Many teams add examples without design. They dump five nearly identical samples. They place them after a long user message. They forget edge cases. Then they say few-shot prompting is weak. Usually the placement or selection was weak.

Where should examples go?

There is no sacred law. But there are strong habits. A common structure is this. System prompt first. Then examples. Then the live user request. That keeps the Standing rulebook stable. Then the Sample deliverables teach the pattern. Then the new Work order arrives last.

┌──────────────────────────────┐
│ system rulebook              │
├──────────────────────────────┤
│ example 1                    │
├──────────────────────────────┤
│ example 2                    │
├──────────────────────────────┤
│ example 3                    │
├──────────────────────────────┤
│ live user request            │
└──────────────────────────────┘

Why does this often work? Because the model sees the contract first. Then it sees solved cases. Then it gets the new case to imitate. The live request benefits from recency. The examples benefit from being near the request. And the system layer still stays above both.

What about putting examples after the user request? Sometimes that works for retrieval-heavy prompts. But it can confuse task focus. The model reads the request, then must backtrack into demonstrations, then come forward again. For short prompts, this may still be fine. For long prompts, it often becomes messier.

Interleaving examples with instructions can help when each rule needs a concrete demonstration. But be careful. Mixed layouts are harder to audit. If your team cannot glance at the prompt and identify each block quickly, you made maintenance harder.

How many examples, and in what order?

Look. More examples are not always better. Each example spends tokens. Each extra example also risks teaching accidental noise. A bloated example set can crowd the Reply form or push the live request into the weak middle. So what to do? Use the minimum set that teaches the pattern and the edge.

A useful mental model is this. One example teaches the happy path. Two or three examples teach contrast. Four or five examples may help when the task has distinct subtypes. Beyond that, measure before you celebrate.

Order matters too. The first example sets expectation. The last example often stays freshest. If one example is weird or low quality, it can contaminate the rest. Many teams place the cleanest representative example first. Then they place one boundary case. Then they place the hardest live-like case last. That is a practical pattern.

Diversity matters even more than count. Three cloned examples teach format. They do not teach robustness. A strong set covers different inputs, different surface wording, and at least one awkward edge. The Sample deliverables should teach the invariant pattern, not just one wording.

Here is the comparison.

bad set                              good set
┌──────────────────────┐             ┌─────────────────────────┐
│ ex1: refund email    │             │ ex1: normal refund ask  │
│ ex2: refund email    │             │ ex2: ambiguous ask      │
│ ex3: refund email    │             │ ex3: refusal boundary   │
└──────────────────────┘             └─────────────────────────┘
format copied                          format + judgment copied

Worked example — bad placement, better placement

Suppose you are building a support triage classifier. The model must return one label. Allowed labels are billing, bug, feature_request, and account_access. A weak prompt might look like this.

Classify this message.
User: I was charged twice and need a refund.

Examples:
User: I cannot log in.
Label: account_access

User: Dark mode should support custom themes.
Label: feature_request

Possible model response.

This looks like a billing issue because the user mentions a refund.

See the problem. The examples come after the task. The label format is inconsistent. The model answered in prose instead of one label. The Reply form was too weak.

Now the repaired prompt.

[SYSTEM]
You are a support triage classifier.
Return exactly one label from this set:
billing | bug | feature_request | account_access
Return the label only.

[EXAMPLES]
User: I cannot log in after resetting my password.
Label: account_access

User: The app crashes when I export a report.
Label: bug

User: I was billed twice for the same month.
Label: billing

[USER]
I was charged twice and need a refund.

Possible model response.

billing

Same base model again. Better placement. Better diversity. Better answer shape. The Sample deliverables sat near the live request. The Reply form stayed explicit. The user case matched a learned pattern quickly.

Now a subtle point. If every example shows only exact matches, then the model may overfit surface words. Include lexical variation. Use one example that says "charged twice." Use another that says "duplicate invoice." Use another that says "double billed." That teaches meaning, not mere phrase copying.

Common failure patterns with few-shot prompts

Failure one. Examples contradict the system prompt. Now the model sees two teachers fighting. The example often wins in practice. So audit them together.

Failure two. Examples are outdated. The product changed labels last month. The prompt still shows the old taxonomy. Then the model faithfully copies obsolete behavior. Bad examples are worse than no examples.

Failure three. One beautiful example dominates style but not task success. Teams pick a sample because it sounds elegant. But it may hide missing citations or missing abstention. Remember, examples teach everything they contain, including accidents.

Failure four. Examples are too many. The prompt becomes a museum. Latency rises. Relevant instructions get buried. The live Work order loses salience. This is why few-shot prompting must be evaluated, not admired.


Where this lives in the wild

  • Anthropic Claude prompt engineers — support and analysis workflows often place XML-wrapped examples between standing instructions and the live query so Claude sees pattern and scope together.
  • GitHub Copilot — classification and transformation prompts inside developer tooling often include terse examples because one good before-and-after pair teaches edit style faster than a paragraph of advice.
  • Harvey legal teams — prompt designers use few-shot memo snippets to show issue-spotting structure, citation style, and hedge language for different legal tasks.
  • Intercom Fin — conversation designers include edge-case support examples so escalations, refunds, and unsupported requests do not all collapse into one cheerful answer pattern.
  • Glean knowledge assistants — enterprise search teams use varied question-answer examples to teach citation placement and abstention across departments with different language habits.

Pause and recall

  • Why do examples often teach faster than abstract instructions alone?
  • Why is "system, then examples, then live request" a strong default?
  • Why can three diverse examples beat five near-duplicates?
  • What accidental behaviors can examples teach if you are careless?

Interview Q&A

Q: Why place few-shot examples after the system prompt and before the live user request? A: That ordering keeps durable rules at the top, demonstrates the target pattern, and leaves the live task close to the model's most recent context.

Common wrong answer to avoid: "Because transformers only remember the last few lines." Recency matters, but the main point is layered clarity plus pattern teaching.

Q: Why can more examples reduce quality instead of improving it? A: Extra examples consume tokens, add noise, and may bury the active task or format rules. Each example must justify its cost.

Common wrong answer to avoid: "More data is always better." Few-shot examples are not training data in the usual sense. They are expensive in-context demonstrations.

Q: Why should example sets emphasize diversity over stylistic polish? A: Because the goal is to teach invariant behavior across varied inputs, not to make the prompt look elegant. Diversity improves transfer to new cases.

Common wrong answer to avoid: "The prettiest example is the best example." Beauty without coverage teaches fragile behavior.

Q: Why can outdated few-shot examples be dangerous? A: The model may faithfully imitate obsolete labels, policies, or answer structure even when the rest of the prompt is correct.

Common wrong answer to avoid: "The model will know the examples are old." It usually will not. It imitates what you show.


Apply now (5 min)

Exercise. Take one task you know. Write one system rule. Then write three Sample deliverables. Make one normal. Make one edge case. Make one refusal or abstention case. Place the live request last.

Sketch from memory. Draw the stack with Standing rulebook first, then examples, then Work order, then Reply form. Mark the last example as the freshest pattern.


Bridge. Good examples teach the right moves. But examples can also teach the wrong moves by omission. So next we need explicit red lines, not just positive demonstrations. → 04-negative-examples.md