02. Capstone Project — Narrative Explainer¶
Companion to 03_study_material.md. This file gives the system picture in your head. The study material gives the operating checklists. Read both.
Table of contents¶
- ELI5 — Building the whole house
- Chapter 1: Opening failure
- Chapter 2: System design
- Chapter 3: Implementation strategy
- Chapter 4: Evaluation and monitoring
- Chapter 5: Deployment and presentation
- Foundation-gap audit
- Bridge to Module 16
- Chapter 6: Recap
ELI5 — Building the whole house¶
Imagine you learned many house skills one by one. Carpentry is transformers. Plumbing is RAG pipelines. Electrical wiring is agents and tool use. Painting is generation quality and UX polish. Very nice. Each skill matters. But nobody lives inside a pile of separate skills. People live inside a complete house. Your capstone is that house. Now the job changes completely. You are no longer proving one isolated craft. You are proving the whole building can stand. Keep five placeholders in your mind. The blueprint means system design. The foundation means infrastructure choices. The plumbing means data pipelines. The inspection means the eval suite. The move-in day means deployment. If the blueprint is vague, rooms connect badly. If the foundation is weak, the house shakes. If the plumbing leaks, dirty water spreads everywhere. If inspection is missing, hidden cracks stay invisible. If move-in day is ignored, nobody can live there. That is the capstone challenge. The hard part is not one clever component. The hard part is making everything cooperate. The hard part is handling surprise. The hard part is shipping something people can trust. That is why this module feels different. It asks for integration, restraint, and judgment. Now picture the house.
/\
/ \\
/ roof\
/______\
| app |
| flows |
|--------|
| evals |
|--------|
| data |
|--------|
| infra |
|________|
A beautiful kitchen means nothing if drainage fails. A strong wall means nothing if wiring is unsafe. A polished exterior means nothing if the roof leaks. Same story here. A smart model is not the product. A high retrieval score is not the product. A fancy agent loop is not the product. A reliable, useful, explainable system is the product. That is the picture for the rest of this explainer.
Chapter 1: Opening failure¶
1.1 The opening disaster¶
You built every component separately. The retriever works in a notebook. The prompt works in a playground. The tool wrapper works in a script. The evaluator works on saved examples. Naturally, you feel confident. Then you connect everything. Latency explodes from 5 seconds to 30 seconds. Costs jump to 10x the budget. Users wait in silence. Some answers arrive late and wrong. Some answers arrive expensive and still wrong. Worse, the component evals stay green. Why? Because those evals never checked the whole journey. They checked local correctness. They missed system failure. That is the opening lesson. A system can fail while its parts pass.
| Symptom | Local story | System truth |
|---|---|---|
| Latency tripled | Each box added only a little time | Serial composition created a queue |
| Cost spiked | More context should improve quality | Prompt bloat amplified every downstream call |
| Quality drifted | The prompt still looked good | Bad retrieval and stale context poisoned generation |
| Logs existed | Each service emitted events | No shared trace linked the request end-to-end |
| Evals passed | Unit checks were healthy | No end-to-end task measurement existed |
Now look at the integration path.
Each extra arrow looks small. Together, the arrows become real pain. Each team optimizes a local metric. Retrieval wants higher recall. So top-k increases. Prompt assembly wants more context. So the prompt grows longer. Safety wants more certainty. So another call gets added. The system becomes slow, expensive, and harder to explain. That is the capstone shock.
1.2 Why this matters for Lead-level roles¶
Lead-level interviews reward system thinking, not isolated cleverness. Panels listen for second-order effects. They want to hear how you cut complexity. They want to hear how you contained failure. They want to hear how you measured reality honestly. Anybody can say, "I used a vector database." Far fewer can say, "I capped top-k because context expansion hurt quality and cost." The second sentence sounds like lived engineering. That is what this module is trying to produce.
Chapter 2: System design¶
2.1 Start with the job, not the model¶
Do not start with a capability. Do not say, "I want to use agents." Do not say, "I want a reasoning model." Start with the user job. Who is the user? What painful recurring task do they have? What outcome would make them grateful? What failure would make them lose trust? Only then should the architecture appear. This is the blueprint stage. Do not skip it.
| Blueprint question | Good answer shape |
|---|---|
| Who is this for? | One persona, one pain, one workflow |
| What does the system do? | One primary job-to-be-done |
| How will we know it works? | Quality, latency, and cost targets |
| What must never happen? | One explicit trust-breaking failure |
2.2 Choose the right architecture¶
| Architecture option | Best when | Benefit | Main risk |
|---|---|---|---|
| Single pipeline | One narrow workflow | Simplest ownership and tracing | Can become rigid later |
| RAG pipeline | Grounded answers matter | Inspectable evidence path | Retrieval quality dominates outcomes |
| Tool-using agent | External actions are essential | Can act, not just answer | Latency and failure modes multiply |
| Async background stage | Heavy enrichment helps later | Keeps user path responsive | More observability work |
| Multi-agent split | Distinct specialists clearly pay off | Role separation can help reasoning | Coordination cost explodes |
Default rule: pick the simplest design that can win. Complexity is not a badge. Complexity is rent. You keep paying it in debugging, latency, and explanation effort. If two designs seem equal, pick the one that makes Saturday-night debugging easier.
2.3 Draw the component diagram¶
+---------------------+
| Frontend |
| chat / form / UI |
+----------+----------+
|
v
+---------------------+
| API / Orchestrator |
| auth, routing, trace|
+-----+-----------+---+
| |
fast path | | action path
v v
+----------------+ +----------------+
| Retrieval Layer| | Tool Gateway |
| embed / search | | APIs, safety |
+-------+--------+ +--------+-------+
| |
v v
+-----------------------------------+
| Prompt Builder / Policy Layer |
| context shaping, guards, cache |
+-----------------+-----------------+
|
v
+-----------------------+
| Model Inference |
| generation / routing |
+-----------+-----------+
|
v
+-----------------------+
| Eval + Telemetry |
| score, logs, cost |
+-----------+-----------+
|
v
user output
Frontend. Purpose: Collects the task clearly. Main danger: Confusing UX creates misleading system signals. Watch metric: task start rate.
API / Orchestrator. Purpose: Owns routing, auth, and trace ids. Main danger: Vague orchestration hides ownership. Watch metric: request latency.
Retrieval layer. Purpose: Finds supporting knowledge. Main danger: Bad chunks poison the answer. Watch metric: retrieval hit quality.
Tool gateway. Purpose: Talks to external systems safely. Main danger: Unbounded retries hurt users. Watch metric: tool success rate.
Policy layer. Purpose: Shapes context and rules. Main danger: Hidden rules inside prompts become brittle. Watch metric: policy violation rate.
Model inference. Purpose: Produces the answer or plan. Main danger: Slow or costly path dominates experience. Watch metric: p95 generation latency.
Eval + telemetry. Purpose: Inspects the full chain. Main danger: No system view means false confidence. Watch metric: task success rate.
2.4 Define API contracts¶
| Contract | Must define | What breaks if vague? |
|---|---|---|
| Request contract | user, task, context, limits | Personalization and tracing become unreliable |
| Retrieval contract | query, filters, top-k, returned schema | Context becomes noisy and bloated |
| Tool contract | args, timeout, retry, approval rules | Actions become unsafe or inconsistent |
| Response contract | answer, citations, confidence, refusal reason | UI and eval logic disagree |
| Telemetry contract | trace id, latency, tokens, cost, error code | Debugging becomes guesswork |
Contracts reduce silent breakage. They turn hand-waving into engineering. They also make architecture reviews faster. Good boundaries make good conversations.
2.5 Plan data flow and failure isolation¶
- User input arrives.
- API validates and tags the request.
- Retrieval or tools fetch supporting context.
- Policy shapes the prompt and rules.
- The model generates the answer or action.
- Telemetry logs the outcome.
- The UI returns the result and stores evidence.
| Dependency fails | Graceful behavior |
|---|---|
| Retriever slow | Serve cached or smaller-context fallback |
| Retriever empty | Return explicit uncertainty and ask a narrower question |
| Tool timeout | Abort safely and preserve user trust |
| Model overloaded | Route to a smaller model or async path |
| Judge unavailable | Use rule-based checks and sample later review |
| Cost budget breached | Skip optional steps immediately |
| UI crash | Persist server-side result for recovery |
| Step | Target latency | Notes |
|---|---|---|
| API validation | 100 ms | Mostly deterministic |
| Retrieval | 300 ms | Cache aggressively |
| Prompt assembly | 100 ms | Watch context bloat |
| Model generation | 2200 ms | Usually dominates |
| Optional tool call | 700 ms | Keep off hot path if possible |
| Eval + logging | 200 ms | Async where safe |
| Total | 3600 ms | Leaves small contingency |
Design with latency and cost budgets from day one. That is the blueprint becoming real. That is also what senior engineers sound like.
Chapter 3: Implementation strategy¶
3.1 MVP first¶
Do not build the final version first. Build the narrowest useful path first. MVP does not mean careless. MVP means smart sequencing. You prove the user journey before adding sophistication. Every day should end with a runnable system. Runnable beats elegant-but-unrun.
| Stage | Goal | Include now | Postpone |
|---|---|---|---|
| Stage 1 | Happy path | basic UI, one request flow, one model path | secondary personas, fancy retries |
| Stage 2 | Replace critical stubs | real retrieval or tool integration | nonessential branches |
| Stage 3 | Inspection | gold set, latency logs, failure notes | full dashboards |
| Stage 4 | Packaging | container, scripts, README skeleton | infra gold plating |
| Stage 5 | Presentation | demo recording, diagrams, narrative | nice-to-have experiments |
3.2 Build vs buy¶
| Component | Default choice | Why | Build yourself when |
|---|---|---|---|
| Foundation model | Use an API first | Fastest iteration | Model choice is core to the artifact |
| Vector store | Managed or open-source | Do not spend the week on plumbing | Search itself is the thesis |
| Auth | Buy or stub lightly | Commodity work can eat the week | Identity complexity is central |
| Prompt versioning | Build lightly in repo | Tiny investment, high clarity | Always worth doing |
| Eval harness | Build | This is your proof of rigor | Always worth doing |
| Observability | Reuse existing tooling | Leverage proven pieces | Custom views are truly needed |
Build what reveals product judgment. Buy what is commodity or time sink. That is not laziness. That is scope intelligence.
3.3 Integration testing¶
/\
/ \\
/ e2e \\
/-------\
/ replay \\
/-----------\
/ contracts \\
/---------------\
/ unit + smoke \\
/___________________\
| Test layer | Main question | Example |
|---|---|---|
| Unit + smoke | Does each box run at all? | Retriever returns chunks |
| Contract tests | Do components agree on shape? | Tool output matches schema |
| Replay tests | Does yesterday's scenario still work? | Known query still cites correctly |
| End-to-end tests | Can the full job succeed? | User gets grounded answer within budget |
Save every important bug as a replay case. That turns pain into leverage. Also test negative paths. Test empty retrieval. Test malformed input. Test tool timeout. Test cost budget breach.
3.4 Honest admission¶
Perfect is the enemy of shipped. This sentence becomes brutal during capstones. You will want one more reranker. You will want one more tool. You will want one more dashboard panel. Sometimes that helps. Often that is scope creep wearing a lab coat. Scope creep sounds responsible. It says, "I am improving the system." Sometimes it is delaying the system. Ask four questions before adding anything. Is this needed for the main demo? Can I measure the benefit this week? Does it reduce user risk or only satisfy curiosity? If I skip it, can I name it honestly as future work? If the answers are weak, postpone it. That is not failure. That is leadership.
Chapter 4: Evaluation and monitoring¶
4.1 System-level evals¶
Component evals ask local questions. System evals ask whether the whole job worked. That distinction matters more than most people expect. A retriever can score well and still poison the answer. A tool can succeed technically and still confuse the user. A judge can approve style and still miss business correctness. Build the inspection suite around tasks, not modules.
| Metric | What it answers | Why it matters |
|---|---|---|
| Task success rate | Did the user complete the job? | Core usefulness metric |
| Citation faithfulness | Are claims supported by evidence? | Critical for grounded systems |
| Action success rate | Did the tool action complete safely? | Critical for agentic systems |
| p50 / p95 latency | How long did the whole flow take? | Users feel the slow tail |
| Cost per successful task | What did one good outcome cost? | Better than raw cost per request |
| Fallback rate | How often did the system degrade? | Reveals brittleness |
| Human escalation rate | How often did people step in? | Shows true readiness |
| Weak eval statement | Strong eval statement |
|---|---|
| It usually works. | 24 of 30 gold scenarios passed. |
| Responses feel fast. | p95 end-to-end latency is 3.8 seconds. |
| Costs are manageable. | $0.018 per successful task on the default path. |
| It handles errors. | Retriever failures trigger fallback in 300 ms. |
4.2 End-to-end latency budget¶
| Segment | Budget | Watch-out |
|---|---|---|
| Validation | 100 ms | Routing logic growing complex |
| Retrieval | 250 ms | top-k too high, cold index |
| Tool selection | 100 ms | agent deliberation too long |
| Tool execution | 600 ms | external APIs slow or flaky |
| Generation | 2200 ms | prompt too long or model too heavy |
| Post-processing + logging | 150 ms | sync logging on hot path |
| Total | 3400 ms | user starts feeling lag |
4.3 Cost tracking¶
| Cost source | What inflates it | First lever to try |
|---|---|---|
| Input tokens | Huge context and repeated history | Compress context, cache prefixes |
| Output tokens | Verbose answers | Cap max tokens |
| Extra model calls | Judge on every request | Sample or batch judging |
| Retrieval overhead | Too many embedding calls | Cache embeddings |
| Tool usage | Redundant validations | Bound retries and approvals |
4.4 User-facing quality metrics¶
Measure what the user feels, not only what is easy. Ask four questions. Was the answer correct? Was the answer useful? Was the answer timely? Was the answer trustworthy? If your metrics ignore one of these, the picture is incomplete.
+------------------------------------------------------+
| Success rate | p95 latency | $ / success | fallbacks |
+------------------------------------------------------+
| 81% | 3.8 s | $0.018 | 7% |
+------------------------------------------------------+
| Common failures: empty retrieval | tool timeout |
| Highest cost path: long-context legal query |
+------------------------------------------------------+
4.5 Retrieval prompts¶
- What system design decision most strongly controls cost in this capstone, and why?
- Which integration failure would bypass component-level evals but break the user experience?
- What three dashboard signals would tell you the system is degrading before users complain?
- Which part of this project best demonstrates Lead-level judgment rather than isolated implementation skill?
These prompts retrieve judgment, not trivia. That is why they matter. Module 16 will ask for exactly this kind of memory.
Chapter 5: Deployment and presentation¶
5.1 Containerization¶
Now comes move-in day. The house may look complete on your machine. But can someone else enter it safely? That is deployment. You do not need a giant platform this week. You do need reproducibility.
| Item | Why it matters |
|---|---|
| Dockerfile or equivalent | Reproducibility |
| Config template | Faster setup and safer secrets |
| Health endpoint | Quick operational sanity |
| Startup script | Reviewers can run it without guessing |
| Model and prompt version notes | Behavior can be reproduced |
| Logs with trace id | Debugging during demo becomes possible |
5.2 CI/CD for ML¶
git push
|
v
CI: lint / test / build image
|
v
run replay evals
|
v
publish image or artifact
|
v
deploy to staging-like target
|
v
smoke test + manual demo check
CI/CD for ML must include behavior checks. Unit tests alone only protect syntax. Replay evals protect behavior. That difference matters a lot.
5.3 Demo preparation¶
- State the problem in one sentence.
- Show the user input.
- Flash the architecture quickly.
- Run the system.
- Show the output with evidence.
- Show one failure mode and mitigation.
- Show one quality metric.
- Show one latency or cost metric.
- End with one production improvement.
Also rehearse failure. Keep a recorded backup. Keep screenshots of metrics. That is not cheating. That is professional risk management.
5.4 Technical writing for the portfolio¶
| README section | What it should answer |
|---|---|
| Problem | Who hurts today and why |
| Solution | What the system does in one paragraph |
| Architecture | Diagram plus core components |
| Decisions | Why this approach over alternatives |
| Evaluation | Gold set, metrics, headline numbers |
| Failure modes | Top risks and current mitigations |
| Future work | What production or Module 16 would add |
5.5 LinkedIn showcase strategy¶
Do not write vague victory poetry. Write like an engineer who shipped something real. Good post structure: Problem. User. Stack. One hard trade-off. Repo and demo link. Honest lesson learned.
Foundation-gap audit¶
| What Module 16 assumes | Diagnostic question | Evidence you should now have |
|---|---|---|
| Full system building experience | Can you walk one request from UI to telemetry without hand-waving? | End-to-end MVP, diagram, run instructions |
| Integration challenges | Have you actually seen contracts break or latency compound? | Bug notes, replay tests, fallback logic |
| Cost and latency trade-offs in practice | Can you name the slowest and costliest steps with numbers? | Budget table, logs, dashboard |
| Deployment basics | Can another engineer run or review the system quickly? | Container, startup command, health check, README |
If any row feels weak, write it down honestly. Weakly understood pain becomes fake principle later. Lived pain becomes durable principle later.
Bridge to Module 16¶
Next module — 20_engineering_leadership_judgment — formalizes the engineering judgment you developed here into principles: how to make technical decisions, manage complexity, and lead AI teams. This week gave you raw experience. Next week turns that experience into reusable judgment.
Chapter 6: Recap¶
6.1 Failure-fix chain¶
| Failure | Fix |
|---|---|
| Component evals missed user failure | Add end-to-end gold scenarios with task-level pass criteria |
| Latency exploded after integration | Budget each stage before adding new boxes |
| Costs blew past plan | Track cost per successful task and route cheaper where possible |
| Architecture became hard to explain | Reduce moving parts and redraw the component diagram |
| Contracts broke silently | Write explicit request, retrieval, tool, and response schemas |
| One dependency failure killed everything | Add graceful degradation and local fallbacks |
| Team optimized local metrics only | Review system metrics first, component metrics second |
| Demo felt magical, not trustworthy | Show evidence, metrics, and one failure mitigation |
| Scope kept growing | Protect the MVP path and postpone unmeasured extras |
6.2 Interview questions¶
| Interview question | What a strong answer should include |
|---|---|
| Why this architecture over multi-agent design? | Defend simplicity, coordination cost, and observability. |
| What was the first system-level failure you discovered? | Tell the integration story, not a component anecdote. |
| How did you budget latency across the pipeline? | Show stepwise targets and the slowest segment. |
| How did you decide what to build versus buy? | Tie the answer to learning value and time limits. |
| What metrics convinced you the system was useful? | Use task success, trust, latency, and cost together. |
| What would you change before real production traffic? | Name auth, monitoring depth, rollout, and incident response improvements. |
6.3 Production experience¶
If you built this capstone seriously, you now have production-shaped experience. You translated a user problem into a system design. You sequenced work for fast learning. You defined boundaries between probabilistic and deterministic components. You instrumented latency and cost. You wrote system-level evals. You prepared a reproducible demo and deployment path. You explained trade-offs to other engineers. That is not full production mastery. Be honest about that. But it is real production-shaped practice.
6.4 Exercises¶
| Exercise | Prompt |
|---|---|
| Exercise 1 | Delete one unnecessary component and write the before/after trade-off. |
| Exercise 2 | Write your exact latency budget with one contingency plan. |
| Exercise 3 | Create five end-to-end gold scenarios, including one timeout case. |
| Exercise 4 | Write the contracts for your most fragile interface. |
| Exercise 5 | Record a two-minute demo and note where the story gets muddy. |
| Exercise 6 | List the top three cost drivers and the first lever for each. |
| Exercise 7 | Write the Module 16 principle that emerged from your hardest bug. |
| Exercise 8 | Ask a reviewer what still feels magical or unclear in the README. |
Final recap in one breath. The blueprint is the system design. The foundation is the infrastructure choice. The plumbing is the data path. The inspection is the eval suite. The move-in day is deployment. If any one is weak, the house feels unsafe. If they work together, people can finally live there. That is the capstone. Not isolated brilliance. Integrated judgment.
Appendix — Rapid-fire reminders¶
- Start from the user job.
- Simplicity is a feature.
- Trace ids are not optional.
- A cheaper success beats an expensive maybe.
- Save failing cases immediately.
- Fallbacks protect trust.
- Long prompts are hidden latency.
- Replay tests are memory with teeth.
- Budgets create discipline.
- Good README writing is engineering clarity.
- Scope creep often sounds responsible.
- Trust is a system property.
- Shipping teaches faster than polishing forever.
- Local optimizations can hurt the whole chain.
- Measure what the user feels.
- A dashboard should tell a story.
- Contracts reduce silent breakage.
- Architecture is choosing pain deliberately.
- Observability starts with good boundaries.
- Boring systems are easier to defend.
- Start from the user job.
- Simplicity is a feature.
- Trace ids are not optional.
- A cheaper success beats an expensive maybe.
- Save failing cases immediately.
- Fallbacks protect trust.
- Long prompts are hidden latency.
- Replay tests are memory with teeth.
- Budgets create discipline.
- Good README writing is engineering clarity.
- Scope creep often sounds responsible.
- Trust is a system property.
- Shipping teaches faster than polishing forever.
- Local optimizations can hurt the whole chain.
- Measure what the user feels.
- A dashboard should tell a story.
- Contracts reduce silent breakage.
- Architecture is choosing pain deliberately.
- Observability starts with good boundaries.
- Boring systems are easier to defend.
- Start from the user job.
- Simplicity is a feature.