02. Capstone Project — Narrative Explainer¶

Companion to 03_study_material.md. This file gives the system picture in your head. The study material gives the operating checklists. Read both.

Table of contents¶

ELI5 — Building the whole house
Chapter 1: Opening failure
Chapter 2: System design
Chapter 3: Implementation strategy
Chapter 4: Evaluation and monitoring
Chapter 5: Deployment and presentation
Foundation-gap audit
Bridge to Module 16
Chapter 6: Recap

ELI5 — Building the whole house¶

Imagine you learned many house skills one by one. Carpentry is transformers. Plumbing is RAG pipelines. Electrical wiring is agents and tool use. Painting is generation quality and UX polish. Very nice. Each skill matters. But nobody lives inside a pile of separate skills. People live inside a complete house. Your capstone is that house. Now the job changes completely. You are no longer proving one isolated craft. You are proving the whole building can stand. Keep five placeholders in your mind. The blueprint means system design. The foundation means infrastructure choices. The plumbing means data pipelines. The inspection means the eval suite. The move-in day means deployment. If the blueprint is vague, rooms connect badly. If the foundation is weak, the house shakes. If the plumbing leaks, dirty water spreads everywhere. If inspection is missing, hidden cracks stay invisible. If move-in day is ignored, nobody can live there. That is the capstone challenge. The hard part is not one clever component. The hard part is making everything cooperate. The hard part is handling surprise. The hard part is shipping something people can trust. That is why this module feels different. It asks for integration, restraint, and judgment. Now picture the house.

                 /\
                /  \\
               / roof\
              /______\
             |  app   |
             | flows  |
             |--------|
             | evals  |
             |--------|
             | data   |
             |--------|
             | infra  |
             |________|

A beautiful kitchen means nothing if drainage fails. A strong wall means nothing if wiring is unsafe. A polished exterior means nothing if the roof leaks. Same story here. A smart model is not the product. A high retrieval score is not the product. A fancy agent loop is not the product. A reliable, useful, explainable system is the product. That is the picture for the rest of this explainer.

Chapter 1: Opening failure¶

1.1 The opening disaster¶

You built every component separately. The retriever works in a notebook. The prompt works in a playground. The tool wrapper works in a script. The evaluator works on saved examples. Naturally, you feel confident. Then you connect everything. Latency explodes from 5 seconds to 30 seconds. Costs jump to 10x the budget. Users wait in silence. Some answers arrive late and wrong. Some answers arrive expensive and still wrong. Worse, the component evals stay green. Why? Because those evals never checked the whole journey. They checked local correctness. They missed system failure. That is the opening lesson. A system can fail while its parts pass.

Symptom	Local story	System truth
Latency tripled	Each box added only a little time	Serial composition created a queue
Cost spiked	More context should improve quality	Prompt bloat amplified every downstream call
Quality drifted	The prompt still looked good	Bad retrieval and stale context poisoned generation
Logs existed	Each service emitted events	No shared trace linked the request end-to-end
Evals passed	Unit checks were healthy	No end-to-end task measurement existed

Now look at the integration path.

user
  |
  v
router -> retriever -> prompt builder -> model -> tool -> judge -> user

Each extra arrow looks small. Together, the arrows become real pain. Each team optimizes a local metric. Retrieval wants higher recall. So top-k increases. Prompt assembly wants more context. So the prompt grows longer. Safety wants more certainty. So another call gets added. The system becomes slow, expensive, and harder to explain. That is the capstone shock.

1.2 Why this matters for Lead-level roles¶

Lead-level interviews reward system thinking, not isolated cleverness. Panels listen for second-order effects. They want to hear how you cut complexity. They want to hear how you contained failure. They want to hear how you measured reality honestly. Anybody can say, "I used a vector database." Far fewer can say, "I capped top-k because context expansion hurt quality and cost." The second sentence sounds like lived engineering. That is what this module is trying to produce.

Chapter 2: System design¶

2.1 Start with the job, not the model¶

Do not start with a capability. Do not say, "I want to use agents." Do not say, "I want a reasoning model." Start with the user job. Who is the user? What painful recurring task do they have? What outcome would make them grateful? What failure would make them lose trust? Only then should the architecture appear. This is the blueprint stage. Do not skip it.

Blueprint question	Good answer shape
Who is this for?	One persona, one pain, one workflow
What does the system do?	One primary job-to-be-done
How will we know it works?	Quality, latency, and cost targets
What must never happen?	One explicit trust-breaking failure

2.2 Choose the right architecture¶

Architecture option	Best when	Benefit	Main risk
Single pipeline	One narrow workflow	Simplest ownership and tracing	Can become rigid later
RAG pipeline	Grounded answers matter	Inspectable evidence path	Retrieval quality dominates outcomes
Tool-using agent	External actions are essential	Can act, not just answer	Latency and failure modes multiply
Async background stage	Heavy enrichment helps later	Keeps user path responsive	More observability work
Multi-agent split	Distinct specialists clearly pay off	Role separation can help reasoning	Coordination cost explodes

Default rule: pick the simplest design that can win. Complexity is not a badge. Complexity is rent. You keep paying it in debugging, latency, and explanation effort. If two designs seem equal, pick the one that makes Saturday-night debugging easier.

2.3 Draw the component diagram¶

                    +---------------------+
                    |      Frontend       |
                    |   chat / form / UI  |
                    +----------+----------+
                               |
                               v
                    +---------------------+
                    | API / Orchestrator  |
                    | auth, routing, trace|
                    +-----+-----------+---+
                          |           |
               fast path  |           | action path
                          v           v
               +----------------+   +----------------+
               | Retrieval Layer|   | Tool Gateway   |
               | embed / search |   | APIs, safety   |
               +-------+--------+   +--------+-------+
                       |                     |
                       v                     v
                 +-----------------------------------+
                 | Prompt Builder / Policy Layer     |
                 | context shaping, guards, cache    |
                 +-----------------+-----------------+
                                   |
                                   v
                         +-----------------------+
                         |   Model Inference     |
                         | generation / routing  |
                         +-----------+-----------+
                                     |
                                     v
                         +-----------------------+
                         | Eval + Telemetry      |
                         | score, logs, cost     |
                         +-----------+-----------+
                                     |
                                     v
                                  user output

Frontend. Purpose: Collects the task clearly. Main danger: Confusing UX creates misleading system signals. Watch metric: task start rate.

API / Orchestrator. Purpose: Owns routing, auth, and trace ids. Main danger: Vague orchestration hides ownership. Watch metric: request latency.

Retrieval layer. Purpose: Finds supporting knowledge. Main danger: Bad chunks poison the answer. Watch metric: retrieval hit quality.

Tool gateway. Purpose: Talks to external systems safely. Main danger: Unbounded retries hurt users. Watch metric: tool success rate.

Policy layer. Purpose: Shapes context and rules. Main danger: Hidden rules inside prompts become brittle. Watch metric: policy violation rate.

Model inference. Purpose: Produces the answer or plan. Main danger: Slow or costly path dominates experience. Watch metric: p95 generation latency.

Eval + telemetry. Purpose: Inspects the full chain. Main danger: No system view means false confidence. Watch metric: task success rate.

2.4 Define API contracts¶

Contract	Must define	What breaks if vague?
Request contract	user, task, context, limits	Personalization and tracing become unreliable
Retrieval contract	query, filters, top-k, returned schema	Context becomes noisy and bloated
Tool contract	args, timeout, retry, approval rules	Actions become unsafe or inconsistent
Response contract	answer, citations, confidence, refusal reason	UI and eval logic disagree
Telemetry contract	trace id, latency, tokens, cost, error code	Debugging becomes guesswork

Contracts reduce silent breakage. They turn hand-waving into engineering. They also make architecture reviews faster. Good boundaries make good conversations.

2.5 Plan data flow and failure isolation¶

User input arrives.
API validates and tags the request.
Retrieval or tools fetch supporting context.
Policy shapes the prompt and rules.
The model generates the answer or action.
Telemetry logs the outcome.
The UI returns the result and stores evidence.

Dependency fails	Graceful behavior
Retriever slow	Serve cached or smaller-context fallback
Retriever empty	Return explicit uncertainty and ask a narrower question
Tool timeout	Abort safely and preserve user trust
Model overloaded	Route to a smaller model or async path
Judge unavailable	Use rule-based checks and sample later review
Cost budget breached	Skip optional steps immediately
UI crash	Persist server-side result for recovery

Step	Target latency	Notes
API validation	100 ms	Mostly deterministic
Retrieval	300 ms	Cache aggressively
Prompt assembly	100 ms	Watch context bloat
Model generation	2200 ms	Usually dominates
Optional tool call	700 ms	Keep off hot path if possible
Eval + logging	200 ms	Async where safe
Total	3600 ms	Leaves small contingency

Design with latency and cost budgets from day one. That is the blueprint becoming real. That is also what senior engineers sound like.

Chapter 3: Implementation strategy¶

3.1 MVP first¶

Do not build the final version first. Build the narrowest useful path first. MVP does not mean careless. MVP means smart sequencing. You prove the user journey before adding sophistication. Every day should end with a runnable system. Runnable beats elegant-but-unrun.

Stage	Goal	Include now	Postpone
Stage 1	Happy path	basic UI, one request flow, one model path	secondary personas, fancy retries
Stage 2	Replace critical stubs	real retrieval or tool integration	nonessential branches
Stage 3	Inspection	gold set, latency logs, failure notes	full dashboards
Stage 4	Packaging	container, scripts, README skeleton	infra gold plating
Stage 5	Presentation	demo recording, diagrams, narrative	nice-to-have experiments

3.2 Build vs buy¶

Component	Default choice	Why	Build yourself when
Foundation model	Use an API first	Fastest iteration	Model choice is core to the artifact
Vector store	Managed or open-source	Do not spend the week on plumbing	Search itself is the thesis
Auth	Buy or stub lightly	Commodity work can eat the week	Identity complexity is central
Prompt versioning	Build lightly in repo	Tiny investment, high clarity	Always worth doing
Eval harness	Build	This is your proof of rigor	Always worth doing
Observability	Reuse existing tooling	Leverage proven pieces	Custom views are truly needed

Build what reveals product judgment. Buy what is commodity or time sink. That is not laziness. That is scope intelligence.

3.3 Integration testing¶

            /\
           /  \\
          / e2e \\
         /-------\
        / replay  \\
       /-----------\
      / contracts   \\
     /---------------\
    / unit + smoke    \\
   /___________________\

Test layer	Main question	Example
Unit + smoke	Does each box run at all?	Retriever returns chunks
Contract tests	Do components agree on shape?	Tool output matches schema
Replay tests	Does yesterday's scenario still work?	Known query still cites correctly
End-to-end tests	Can the full job succeed?	User gets grounded answer within budget

Save every important bug as a replay case. That turns pain into leverage. Also test negative paths. Test empty retrieval. Test malformed input. Test tool timeout. Test cost budget breach.

3.4 Honest admission¶

Perfect is the enemy of shipped. This sentence becomes brutal during capstones. You will want one more reranker. You will want one more tool. You will want one more dashboard panel. Sometimes that helps. Often that is scope creep wearing a lab coat. Scope creep sounds responsible. It says, "I am improving the system." Sometimes it is delaying the system. Ask four questions before adding anything. Is this needed for the main demo? Can I measure the benefit this week? Does it reduce user risk or only satisfy curiosity? If I skip it, can I name it honestly as future work? If the answers are weak, postpone it. That is not failure. That is leadership.

Chapter 4: Evaluation and monitoring¶

4.1 System-level evals¶

Component evals ask local questions. System evals ask whether the whole job worked. That distinction matters more than most people expect. A retriever can score well and still poison the answer. A tool can succeed technically and still confuse the user. A judge can approve style and still miss business correctness. Build the inspection suite around tasks, not modules.

Metric	What it answers	Why it matters
Task success rate	Did the user complete the job?	Core usefulness metric
Citation faithfulness	Are claims supported by evidence?	Critical for grounded systems
Action success rate	Did the tool action complete safely?	Critical for agentic systems
p50 / p95 latency	How long did the whole flow take?	Users feel the slow tail
Cost per successful task	What did one good outcome cost?	Better than raw cost per request
Fallback rate	How often did the system degrade?	Reveals brittleness
Human escalation rate	How often did people step in?	Shows true readiness

Weak eval statement	Strong eval statement
It usually works.	24 of 30 gold scenarios passed.
Responses feel fast.	p95 end-to-end latency is 3.8 seconds.
Costs are manageable.	$0.018 per successful task on the default path.
It handles errors.	Retriever failures trigger fallback in 300 ms.

4.2 End-to-end latency budget¶

Segment	Budget	Watch-out
Validation	100 ms	Routing logic growing complex
Retrieval	250 ms	top-k too high, cold index
Tool selection	100 ms	agent deliberation too long
Tool execution	600 ms	external APIs slow or flaky
Generation	2200 ms	prompt too long or model too heavy
Post-processing + logging	150 ms	sync logging on hot path
Total	3400 ms	user starts feeling lag

4.3 Cost tracking¶

Cost source	What inflates it	First lever to try
Input tokens	Huge context and repeated history	Compress context, cache prefixes
Output tokens	Verbose answers	Cap max tokens
Extra model calls	Judge on every request	Sample or batch judging
Retrieval overhead	Too many embedding calls	Cache embeddings
Tool usage	Redundant validations	Bound retries and approvals

4.4 User-facing quality metrics¶

Measure what the user feels, not only what is easy. Ask four questions. Was the answer correct? Was the answer useful? Was the answer timely? Was the answer trustworthy? If your metrics ignore one of these, the picture is incomplete.

+------------------------------------------------------+
| Success rate | p95 latency | $ / success | fallbacks |
+------------------------------------------------------+
| 81%          | 3.8 s       | $0.018      | 7%       |
+------------------------------------------------------+
| Common failures: empty retrieval | tool timeout      |
| Highest cost path: long-context legal query          |
+------------------------------------------------------+

4.5 Retrieval prompts¶

What system design decision most strongly controls cost in this capstone, and why?
Which integration failure would bypass component-level evals but break the user experience?
What three dashboard signals would tell you the system is degrading before users complain?
Which part of this project best demonstrates Lead-level judgment rather than isolated implementation skill?

These prompts retrieve judgment, not trivia. That is why they matter. Module 16 will ask for exactly this kind of memory.

Chapter 5: Deployment and presentation¶

5.1 Containerization¶

Now comes move-in day. The house may look complete on your machine. But can someone else enter it safely? That is deployment. You do not need a giant platform this week. You do need reproducibility.

Item	Why it matters
Dockerfile or equivalent	Reproducibility
Config template	Faster setup and safer secrets
Health endpoint	Quick operational sanity
Startup script	Reviewers can run it without guessing
Model and prompt version notes	Behavior can be reproduced
Logs with trace id	Debugging during demo becomes possible

5.2 CI/CD for ML¶

git push
   |
   v
CI: lint / test / build image
   |
   v
run replay evals
   |
   v
publish image or artifact
   |
   v
deploy to staging-like target
   |
   v
smoke test + manual demo check

CI/CD for ML must include behavior checks. Unit tests alone only protect syntax. Replay evals protect behavior. That difference matters a lot.

5.3 Demo preparation¶

State the problem in one sentence.
Show the user input.
Flash the architecture quickly.
Run the system.
Show the output with evidence.
Show one failure mode and mitigation.
Show one quality metric.
Show one latency or cost metric.
End with one production improvement.

Also rehearse failure. Keep a recorded backup. Keep screenshots of metrics. That is not cheating. That is professional risk management.

5.4 Technical writing for the portfolio¶

README section	What it should answer
Problem	Who hurts today and why
Solution	What the system does in one paragraph
Architecture	Diagram plus core components
Decisions	Why this approach over alternatives
Evaluation	Gold set, metrics, headline numbers
Failure modes	Top risks and current mitigations
Future work	What production or Module 16 would add

5.5 LinkedIn showcase strategy¶

Do not write vague victory poetry. Write like an engineer who shipped something real. Good post structure: Problem. User. Stack. One hard trade-off. Repo and demo link. Honest lesson learned.

Foundation-gap audit¶

What Module 16 assumes	Diagnostic question	Evidence you should now have
Full system building experience	Can you walk one request from UI to telemetry without hand-waving?	End-to-end MVP, diagram, run instructions
Integration challenges	Have you actually seen contracts break or latency compound?	Bug notes, replay tests, fallback logic
Cost and latency trade-offs in practice	Can you name the slowest and costliest steps with numbers?	Budget table, logs, dashboard
Deployment basics	Can another engineer run or review the system quickly?	Container, startup command, health check, README

If any row feels weak, write it down honestly. Weakly understood pain becomes fake principle later. Lived pain becomes durable principle later.

Bridge to Module 16¶

Next module — 20_engineering_leadership_judgment — formalizes the engineering judgment you developed here into principles: how to make technical decisions, manage complexity, and lead AI teams. This week gave you raw experience. Next week turns that experience into reusable judgment.

Chapter 6: Recap¶

6.1 Failure-fix chain¶

Failure	Fix
Component evals missed user failure	Add end-to-end gold scenarios with task-level pass criteria
Latency exploded after integration	Budget each stage before adding new boxes
Costs blew past plan	Track cost per successful task and route cheaper where possible
Architecture became hard to explain	Reduce moving parts and redraw the component diagram
Contracts broke silently	Write explicit request, retrieval, tool, and response schemas
One dependency failure killed everything	Add graceful degradation and local fallbacks
Team optimized local metrics only	Review system metrics first, component metrics second
Demo felt magical, not trustworthy	Show evidence, metrics, and one failure mitigation
Scope kept growing	Protect the MVP path and postpone unmeasured extras

6.2 Interview questions¶

Interview question	What a strong answer should include
Why this architecture over multi-agent design?	Defend simplicity, coordination cost, and observability.
What was the first system-level failure you discovered?	Tell the integration story, not a component anecdote.
How did you budget latency across the pipeline?	Show stepwise targets and the slowest segment.
How did you decide what to build versus buy?	Tie the answer to learning value and time limits.
What metrics convinced you the system was useful?	Use task success, trust, latency, and cost together.
What would you change before real production traffic?	Name auth, monitoring depth, rollout, and incident response improvements.

6.3 Production experience¶

If you built this capstone seriously, you now have production-shaped experience. You translated a user problem into a system design. You sequenced work for fast learning. You defined boundaries between probabilistic and deterministic components. You instrumented latency and cost. You wrote system-level evals. You prepared a reproducible demo and deployment path. You explained trade-offs to other engineers. That is not full production mastery. Be honest about that. But it is real production-shaped practice.

6.4 Exercises¶

Exercise	Prompt
Exercise 1	Delete one unnecessary component and write the before/after trade-off.
Exercise 2	Write your exact latency budget with one contingency plan.
Exercise 3	Create five end-to-end gold scenarios, including one timeout case.
Exercise 4	Write the contracts for your most fragile interface.
Exercise 5	Record a two-minute demo and note where the story gets muddy.
Exercise 6	List the top three cost drivers and the first lever for each.
Exercise 7	Write the Module 16 principle that emerged from your hardest bug.
Exercise 8	Ask a reviewer what still feels magical or unclear in the README.

Final recap in one breath. The blueprint is the system design. The foundation is the infrastructure choice. The plumbing is the data path. The inspection is the eval suite. The move-in day is deployment. If any one is weak, the house feels unsafe. If they work together, people can finally live there. That is the capstone. Not isolated brilliance. Integrated judgment.

Appendix — Rapid-fire reminders¶

Start from the user job.
Simplicity is a feature.
Trace ids are not optional.
A cheaper success beats an expensive maybe.
Save failing cases immediately.
Fallbacks protect trust.
Long prompts are hidden latency.
Replay tests are memory with teeth.
Budgets create discipline.
Good README writing is engineering clarity.
Scope creep often sounds responsible.
Trust is a system property.
Shipping teaches faster than polishing forever.
Local optimizations can hurt the whole chain.
Measure what the user feels.
A dashboard should tell a story.
Contracts reduce silent breakage.
Architecture is choosing pain deliberately.
Observability starts with good boundaries.
Boring systems are easier to defend.
Start from the user job.
Simplicity is a feature.
Trace ids are not optional.
A cheaper success beats an expensive maybe.
Save failing cases immediately.
Fallbacks protect trust.
Long prompts are hidden latency.
Replay tests are memory with teeth.
Budgets create discipline.
Good README writing is engineering clarity.
Scope creep often sounds responsible.
Trust is a system property.
Shipping teaches faster than polishing forever.
Local optimizations can hurt the whole chain.
Measure what the user feels.
A dashboard should tell a story.
Contracts reduce silent breakage.
Architecture is choosing pain deliberately.
Observability starts with good boundaries.
Boring systems are easier to defend.
Start from the user job.
Simplicity is a feature.