Skip to content

01. Opening Failure — Parts Pass, System Fails

~11 min read. You built every component to specification — then wired them together and the answer was wrong.

Built on the ELI5 in 00-eli5.md. The blueprint — the system design document that captures what you are building and why — does not exist yet. That is why integration breaks first.


The integration gap nobody warns you about

See. Each module in this curriculum tested one idea. The retriever returned the right chunks. Pass. The prompt produced the right format. Pass. The agent called the right tool. Pass.

Now chain them. The right chunks go into the wrong prompt format. The agent calls the tool, but the output is not what the next step expected. The whole pipeline returns a confident wrong answer.

This is called the integration gap. Components pass unit tests. Systems fail integration tests. Simple, no?

Here is the picture. Think of three pipes, each carrying clean water.

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Retriever│────▶│  Prompt  │────▶│  Model   │
│  OK ✓    │     │  OK ✓    │     │  OK ✓    │
└──────────┘     └──────────┘     └──────────┘
       ▲                ▲                ▲
       │                │                │
   Tested alone    Tested alone    Tested alone
                                   ┌──────────┐
                                   │  Output  │
                                   │  WRONG ✗ │
                                   └──────────┘

The failure is in the interfaces, not the components. Interfaces are the spaces between the boxes.


Why interfaces fail

Interface failures come in three kinds.

Format mismatch. Retriever returns chunks as a list of dicts. Prompt template expects a plain string. No error is thrown. The dict is coerced to a string. The model sees [{'text': 'answer...', 'score': 0.87}] as context. It produces garbage.

Semantic mismatch. Retriever returns technically relevant chunks. But the chunks answer a slightly different question than the user asked. The model is not told this. It tries to answer anyway. The answer is confident and wrong.

Latency cascade. Each component is fast alone. Retriever: 120 ms. Embedding: 80 ms. LLM call: 600 ms. In sequence: 800 ms. Under load, the LLM call spikes to 2 400 ms. The retriever timeout fires at 1 500 ms. The whole pipeline returns an error, not a slow answer.

Look. The lesson here is not that components are bad. The lesson is that interfaces need their own tests.


A worked example: the cost of one mismatch

Say your retriever returns 5 chunks, each 200 tokens. Your context window budget is 1 000 tokens for context. That fits perfectly — 5 × 200 = 1 000. Good.

Now the retriever team adds metadata to each chunk. Suddenly each chunk is 200 tokens of text + 50 tokens of metadata = 250 tokens. Total: 5 × 250 = 1 250 tokens. Over budget by 250 tokens.

What happens?

Context budget:  1 000 tokens
Actual context:  1 250 tokens
Overflow:          250 tokens  →  last chunk truncated mid-sentence

Chunk 5 before: "The refund policy covers all purchases made within 30 days..."
Chunk 5 after:  "The refund policy covers all purchases made within"

The model now reads an incomplete policy statement. It hallucinates the rest. The customer gets wrong refund information. No component failed. The interface failed.

This is the plumbing breaking — the connection between what the retriever produces and what the prompt assembly step consumes.


How to detect integration failures early

Do not wait for a user to report a wrong answer. Build a canary check — a small set of end-to-end test cases — that you run on every code change. Each test case has a fixed input, a known retrieval result, and an expected output property.

Use assertions at each boundary in CI:

Input query  →  assert retriever returns ≥ 3 chunks with score > 0.7
Chunks       →  assert total token count ≤ context budget
Assembled    →  assert prompt contains required sections (system, context, question)
LLM output   →  assert output is valid JSON if format was requested

This is defensive programming applied to AI systems. Each assertion is a smoke alarm. You want loud alarms, not silent smoke.


The three questions to ask at every interface

Before connecting any two components, answer three questions.

1. What format does the upstream component produce? Write it down. Not in your head — write it.

2. What format does the downstream component expect? Write it down. Explicitly.

3. What happens when the two formats drift? Add a schema check or an assertion at the boundary. Make the failure loud and early, not silent and late.

See. Silent failures are the worst failures in AI systems. A loud crash is a gift. You know exactly where to look. A confident wrong answer can run in production for weeks.


Where this lives in the wild

  • GitHub Copilot — retrieval of code context feeds a prompt template; format contract between retrieval and prompt is explicit and versioned.
  • Notion AI — page content chunking must match summarization prompt expectations; the interface is tested in isolation.
  • Perplexity.ai — search snippets are structured before entering the answer-generation prompt; a schema mismatch would corrupt citations.
  • Intercom Fin — support document retrieval outputs feed a strict answer template; interface validation runs on every deployment.
  • Amazon Alexa LLM — voice recognition output format is normalized before entering the language model; interface contract is enforced by a marshalling layer.

Pause and recall

  1. What is the integration gap? Define it in one sentence without looking up.
  2. Name the three types of interface failures from this file.
  3. In the token example, how many tokens of overflow occurred? What was the effect?
  4. What are the three questions to ask at every interface?

Interview Q&A

Q: "You said your RAG system works. How do you know the retriever and the generator are actually compatible?"

A: I test the interface explicitly. I log the exact output of the retriever and feed it into an integration test before the generator. I also schema-validate the retriever output on every call.

Common wrong answer to avoid: "They work together because I tested them both individually." Individual tests do not cover interface contracts.


Q: "A customer reports wrong answers. All your component tests pass. How do you debug this?"

A: I look at the interfaces first. I log the full payload at each boundary — what went in, what came out. I compare the retriever output format against what the prompt template expects. Silent format coercions are the first suspects.

Common wrong answer to avoid: "I re-run the unit tests." Unit tests passing is the problem, not the evidence of innocence.


Q: "What is a latency cascade and how do you prevent it?"

A: A latency cascade is when one component's slow response causes another component's timeout to fire, collapsing the whole pipeline. Prevention: set timeouts at each component independently, add circuit breakers, and test under load, not just correctness.

Common wrong answer to avoid: "I increase the global timeout." That hides the real bottleneck and makes the user wait longer.


Q: "How do you write an integration test for an LLM pipeline?"

A: I fix a golden input, capture the retriever output for that input, run the full pipeline, and assert that the final output matches expected properties — not exact text, but structure, factual claims, and format. I re-run this on every merge.

Common wrong answer to avoid: "Integration tests are too slow, I skip them." Skipping integration tests is how silent failures reach production.


Apply now (5 min)

Pick any two components you have built so far in this course. Draw the interface between them on paper. Write down: the exact output format of component A. Write down: the exact input format expected by component B. Find one place they could silently mismatch. Write one assertion that would catch that mismatch.

Sketch from memory: Without looking back, draw the three-pipe ASCII diagram showing components passing individually but failing together. Add labels at each interface.


Bridge. Now that you know what breaks and why, the question becomes: what should you build in the first place? Start with the user, not the model. That is the blueprint. → 02-system-design-blueprint.md