Skip to content

06. Prompt Engineering for the Project — System Prompts, Few-Shot, Chain-of-Thought

~11 min read. A prompt is not magic text — it is a precise specification that the model tries to follow; treat it like code.

Built on the ELI5 in 00-eli5.md. The blueprint — the system design document — already told the model's role. Now we turn that role into a prompt the model can actually follow.


A prompt is a specification, not incantation

See. Many people think prompt engineering is about finding magic words. Add "think step by step" and it gets smarter. Add "you are an expert" and it improves. Sometimes those tricks help. Often they do not.

Real prompt engineering is specification writing. You are telling a very capable but literal executor exactly what to do.

Bad specification:    "Answer questions helpfully."
Good specification:   "You are a support assistant for Acme Inc.
                       Answer questions using ONLY the provided context.
                       If the answer is not in the context, say 'I don't know.'
                       Always cite the article title you are drawing from.
                       Respond in 3 sentences or fewer."

The bad version is a wish. The good version is a contract. Write prompts like contracts.


The three layers of a production prompt

Every production system prompt has three layers.

┌────────────────────────────────────────────────────────────┐
│  Layer 1: Role Definition (who the model is)               │
│  "You are a support assistant for Acme Inc. ..."           │
├────────────────────────────────────────────────────────────┤
│  Layer 2: Task Instructions (what to do and not do)        │
│  "Answer using ONLY the provided context. If ..."          │
├────────────────────────────────────────────────────────────┤
│  Layer 3: Output Format (how to structure the response)    │
│  "Respond in JSON: {answer: string, source: string}"       │
└────────────────────────────────────────────────────────────┘

Layer 1 sets the persona. It reduces hallucination by giving the model a bounded identity. Layer 2 gives the rules. This is where "I don't know" behaviour is specified. Layer 3 enforces structure. Structured output is parseable downstream.

Simple, no? Never collapse layers. Do not mix format rules into the role definition. Keep them separate so you can tune each independently.


Few-shot examples: when and how

Few-shot means including example input-output pairs in the prompt.

When to use few-shot: - The output format is unusual and not covered by model training. - You need to enforce a very specific tone or style. - The task requires a reasoning pattern the model misses zero-shot.

When NOT to use few-shot: - When the task is common (summarisation, translation, simple Q&A). - When examples add tokens without improving accuracy. - When examples are stale and the task has changed.

Few-shot template:

EXAMPLE 1:
User question: "What is the return period?"
Context: "Acme returns are accepted within 30 days of purchase."
Answer: {"answer": "The return period is 30 days.", "source": "Return Policy"}

EXAMPLE 2:
User question: "Can I return a digital download?"
Context: "Digital products are non-refundable."
Answer: {"answer": "Digital downloads cannot be returned.", "source": "Digital Policy"}

Look. Each example shows the reasoning pattern: use context, cite source, give short answer. Two examples are usually enough. Three if the task has edge cases. More than five rarely helps and always costs tokens.


Chain-of-thought: picture before formula

Chain-of-thought (CoT) prompting instructs the model to reason before answering. But first — the picture.

Without CoT:

User: "Should I approve this refund? Order: 28 days old, original condition, receipt present."
Model: "Approve."  ← correct but you cannot audit the reasoning

With CoT:

User: "Should I approve this refund? Think step by step."
Model: "Step 1: Return period is 30 days. Order is 28 days old. Within limit.
        Step 2: Condition is original. Policy requires original condition. Satisfied.
        Step 3: Receipt is present. Policy requires receipt. Satisfied.
        Decision: Approve."

CoT is most valuable when: - The task involves multiple conditions that must all be satisfied. - You need to audit decisions for compliance or debugging. - The model makes errors without intermediate reasoning.

CoT is wasteful when: - The answer is simple and lookup-based. - Latency SLA is tight (CoT adds 200–600 tokens to the output).


Worked example: prompt token budget

You have a 4 000-token total context budget for the API call. Let us allocate it.

System prompt (role + instructions + few-shot × 2):  450 tokens
Context block (3 retrieved chunks × 250 tokens):     750 tokens
User query:                                           50 tokens
CoT output budget (model response):                  500 tokens
──────────────────────────────────────────────────────────────
Total input:    450 + 750 + 50 = 1 250 tokens  (input)
Total output:   500 tokens  (output)
Grand total:    1 750 tokens (input + output)
API cost:       gpt-4o-mini at $0.15 / 1M input + $0.60 / 1M output
                = (1250 × 0.15) + (500 × 0.60) / 1 000 000
                = 0.0001875 + 0.0003 = $0.0004875 per call

See. At $0.0004875 per call, you can serve 2 000 calls for about $1. Staying within the token budget matters for cost. Every unnecessary few-shot example costs money at scale.


Prompt versioning and testing

Prompts are code. Version them.

prompts/
  v1_support_assistant.txt   ← baseline
  v2_support_assistant.txt   ← added few-shot
  v3_support_assistant.txt   ← added CoT instruction

For every version, record: - What changed. - Eval score before and after. - Token count before and after.

Never deploy a new prompt version without running the inspection — the full eval suite. A prompt change that improves 3 test queries and breaks 8 others is a regression.


Where this lives in the wild

  • GitHub Copilot — system prompt specifies code-only responses; few-shot examples demonstrate code format with inline comments.
  • Duolingo Max — role definition includes language level of the learner; CoT used for grammar explanation tasks only.
  • Notion AI — output format layer enforces Markdown with specific heading levels to match Notion block structure.
  • Intercom Fin — strict "I don't know" instruction in Layer 2; without it, the model would hallucinate policies.
  • Perplexity.ai — CoT reasoning shown before the final answer; allows users to audit source usage.

Pause and recall

  1. What are the three layers of a production system prompt? Name them without looking.
  2. In the token budget example, what was the total cost per call?
  3. When does CoT add value? When does it waste tokens?
  4. How many few-shot examples are usually enough?

Interview Q&A

Q: "How do you write a system prompt for a production RAG assistant?"

A: Three layers: role definition, task instructions (including what to do when the context does not contain the answer), and output format. I enforce "I don't know" behaviour explicitly — do not answer from model knowledge, only from context. I version the prompt and test every change against the eval suite.

Common wrong answer to avoid: "I write a few sentences describing the assistant." A few descriptive sentences without explicit rules produce inconsistent, unauditable behaviour.


Q: "What is chain-of-thought prompting and when does it help?"

A: CoT instructs the model to reason step by step before giving a final answer. It helps on multi-condition tasks where each condition must be checked, and on tasks where you need to audit the decision. It hurts on simple lookup tasks and tight latency SLAs because it adds 200–600 output tokens.

Common wrong answer to avoid: "CoT always improves results." CoT adds latency and cost. On simple tasks it is waste.


Q: "How do you test a prompt change without deploying it?"

A: I run the full eval suite against both the old prompt and the new prompt on the same held-out test set. I compare precision, recall, and format compliance. I deploy only if the new prompt improves on all three metrics or improves significantly on one without regressing on others.

Common wrong answer to avoid: "I test on a few examples and check if they look better." Manual inspection of a few examples is not a test. It is an opinion.


Q: "How do you handle a model that ignores your output format instructions?"

A: First, move the format instruction to Layer 3 and make it explicit with an example. If that fails, add a few-shot example showing the correct format. If the model still ignores it, add post-processing to parse and coerce the output. As a last resort, use structured output mode (JSON mode or function calling) to enforce the format at the API level.

Common wrong answer to avoid: "I just parse whatever the model returns." Fragile parsing breaks on every model upgrade.


Apply now (5 min)

Write the complete system prompt for your capstone project. Include all three layers. Be explicit about "I don't know" behaviour. Specify the exact output format — use JSON with field names. Count the tokens. Make sure you are under 500 for the system prompt alone.

Sketch from memory: Draw the three-layer prompt structure. Fill in your own project's role definition and output format from memory.


Bridge. Prompt written and versioned. Now we need a way to know if it actually works — not just on demo queries, but on the full user distribution. That is the inspection. → 07-evaluation-design.md