12. Integration Debugging — When the Whole System Breaks¶
~11 min read. Every integrated system breaks eventually — the engineers who debug fastest are the ones who prepared the most.
Built on the ELI5 in 00-eli5.md. The plumbing — data pipelines connecting all components — breaks in ways individual components do not. Integration debugging is a skill. This file teaches the method.
The first rule: do not debug from the output¶
See. Your system returns a wrong answer. The instinct: stare at the output. Adjust the prompt. Run again. This is the wrong method.
The output is the last thing produced. The wrong answer could have been caused by something at step 1 of 5. If you fix the prompt and the root cause was the retriever, you patched the symptom. The real failure is still there, waiting to manifest differently.
The right method: trace backward from the output to the input. At each step, ask: "Is the output of this step correct?" Stop at the first step where the answer is "no." That is your failure.
This is called bisection debugging. It works on AI pipelines exactly as it works on traditional software.
The debugging stack: what to check in order¶
┌───────────────┐
│ Step 5: Output│
└───────┬───────┘
▼
┌───────────────┐
│ Step 4: LLM │
└───────┬───────┘
▼
┌───────────────┐
│ Step 3: Build │
│ context │
└───────┬───────┘
▼
┌───────────────┐
│ Step 2: │
│ Retrieval │
└───────┬───────┘
▼
┌───────────────┐
│ Step 1: Query │
└───────────────┘
Step 5 (output): Is the final response correct and well-formatted?
→ If NO, check step 4.
→ If YES, the system is working.
Step 4 (LLM): Is the LLM producing the right response given the prompt?
→ Log the exact prompt sent. Paste into playground. Test manually.
→ If playground gives correct answer, the prompt assembly is the issue.
→ If playground also fails, the prompt content or model is the issue.
Step 3 (assembly): Is the assembled context correct?
→ Log the full prompt before sending. Inspect every section.
→ Is the context block populated? Is it truncated? Is it relevant?
Step 2 (retrieval): Are the retrieved chunks correct?
→ Log the top-k chunks and their scores.
→ Are the chunks relevant to the query?
→ Is the score above the minimum threshold?
Step 1 (query): Is the user query being parsed correctly?
→ Log the raw input and the parsed query.
→ Is the query being transformed before embedding? Is that correct?
Look. You check step 5 first. If it is wrong, go to step 4. You never skip steps. Skipping steps is how you fix the wrong thing confidently.
Common integration failures and their signatures¶
Failure A: Correct retrieval, wrong answer. Signature: retrieved chunks are relevant, but the model ignores them. Cause: system prompt does not enforce "answer only from context." Fix: add explicit instruction. Test with adversarial query (ask about something NOT in context).
Failure B: Wrong retrieval, reasonable-looking answer. Signature: model produces fluent, confident answer but it is not grounded in the actual KB. Cause: retriever returning irrelevant chunks with moderate scores. Fix: raise the minimum retrieval score threshold. Check embedding model alignment.
Failure C: No retrieval results. Signature: retriever returns empty or below-threshold results. Cause: query embedding and document embedding are misaligned (different models or versions). Fix: verify that query and documents use the same embedding model.
Failure D: Correct answer, wrong format. Signature: response is factually correct but not in the expected JSON or structured format. Cause: model ignored the format instruction; or a new model version changed default format. Fix: move format enforcement to Layer 3 of prompt. Add post-processing validation.
Failure E: Intermittent wrong answers. Signature: same query sometimes returns a good answer, sometimes a wrong answer. Cause: LLM non-determinism (temperature > 0) combined with borderline retrieval scores. Fix: reduce temperature. Add retrieval score logging to identify borderline cases.
Worked example: debugging a real integration failure¶
The system is returning wrong refund policy information. Precision@3 dropped from 0.78 to 0.54 after a KB update.
Step 1: Check the output. Wrong answer confirmed. Step 2: Check the retrieval. Pull logs.
Query: "Can I return a sale item?"
Retrieved:
Chunk 1 (score 0.71): "Sale items are excluded from the standard return policy."
Chunk 2 (score 0.68): "All items purchased at full price can be returned within 30 days."
Chunk 3 (score 0.67): "Gift cards are non-refundable."
Chunk 1 is correct and should be the top result. But something is wrong. Check step 3: assembly.
Context block received by LLM:
[Chunk 3 — score 0.67]: "Gift cards are non-refundable."
[Chunk 2 — score 0.68]: "All items purchased at full price..."
[Chunk 1 — score 0.71]: "Sale items are excluded..."
See. The chunks are sorted ascending, not descending. The LLM reads top-to-bottom. It weighted Chunk 3 most heavily. The KB update added new chunking code that changed the sort order.
Root cause: a code change in the context assembly step reversed the chunk sort order.
Fix: change sort(ascending=True) to sort(ascending=False) in the assembly function.
Result: precision@3 recovers to 0.79.
This failure was not in the retriever. Not in the prompt. Not in the model. It was in the plumbing — the assembly step between retrieval and generation.
Debugging tools and techniques¶
Tool 1: Trace logging
Log every step's input and output with a shared trace ID.
Pull by trace ID to replay any request.
Tool 2: Playground isolation
Copy the exact assembled prompt. Paste into the model playground.
Remove context. Ask the question. Does the model answer from its own knowledge?
That tells you if the model is following the "only use context" instruction.
Tool 3: Retrieval inspector
Query the retriever directly with the raw query.
Print top-10 results with scores. Are they relevant? Are scores high enough?
Tool 4: Schema validator
Add a JSON schema validator on the output before returning to the user.
If validation fails, log the raw output. This catches format failures silently.
Tool 5: A/B replay
When a bug is reported, replay the exact query against both the current and previous version.
Compare outputs and trace logs. Spot the diff.
Where this lives in the wild¶
- LangSmith — trace replay tool; every LLM call logged with input/output; playback failed requests.
- Datadog AI Observability — span-level tracing for AI pipelines; identify which span introduced the failure.
- Anthropic's debugging practice — prompt isolation in playground before touching any retrieval or code changes.
- Glean engineering — retrieval inspector built into their debug UI; engineers can query the vector index directly.
- Linear AI debugging — reproducible test cases for every reported bug; replay logs to diagnose AI triage mistakes.
Pause and recall¶
- What is bisection debugging and how does it apply to AI pipelines?
- Name three common integration failures and their signatures.
- In the worked example, what was the root cause of the precision drop?
- Name three debugging tools from the last section.
Interview Q&A¶
Q: "A customer reports wrong answers from your AI system. Walk me through how you debug it."
A: I pull the trace for the failing request. I check step by step: was the query parsed correctly? What did the retriever return? What was assembled into the context? What did the model produce? I stop at the first step where the output is wrong. I fix only that step.
Common wrong answer to avoid: "I adjust the prompt and see if it helps." Adjusting the prompt without diagnosis is guessing. You might accidentally fix the symptom while the root cause persists.
Q: "How do you reproduce an intermittent AI bug?"
A: I log the full input to every step — including the random seed if temperature > 0. For intermittent failures, I look at the retrieval scores. Scores close to the minimum threshold cause flipping behaviour as small embedding variations push them above or below the cutoff. Fix: raise the threshold or lower the temperature.
Common wrong answer to avoid: "Intermittent bugs are caused by the model being non-deterministic, nothing we can do." Non-determinism is manageable. The cause is almost always borderline retrieval scores, not pure randomness.
Q: "All your component tests pass but the end-to-end system gives wrong answers. What do you do?"
A: I add integration-level logging at every boundary. I replay the failing request through the full pipeline with detailed logging. I compare the interface contract at each boundary — what is produced vs. what is consumed. The failure is always at a boundary, never inside a component that passes its own tests.
Common wrong answer to avoid: "I re-run the component tests with different inputs." Component tests passing is the problem statement, not the diagnostic evidence. The failure is in the interfaces.
Q: "How do you prevent integration bugs from reaching production?"
A: Three defences. First, interface-level assertions in CI — schema checks at every boundary that run on every commit. Second, end-to-end integration tests in the staging eval suite — the full pipeline on representative queries. Third, canary deployment — real-traffic quality monitoring before full rollout catches the failures that staging misses.
Common wrong answer to avoid: "I rely on staging to catch everything." Staging cannot reproduce the full real user distribution. Canary is required.
Apply now (5 min)¶
Pick one failure mode from the five listed in this file. Write the exact log output you would inspect at each step to diagnose it. Write the fix in one sentence. Write the integration test assertion that would have caught it before production.
Sketch from memory: Draw the five-step debugging stack. Label each step. Write one question to ask at each step.
Bridge. You can build, deploy, and debug the system. But before you call yourself an AI engineer, you must reckon honestly with what we do not yet know. That is what comes next. → 13-honest-admission.md