11. Presentation and Portfolio — Demo, Writeup, What Interviewers Actually Want¶

~11 min read. A capstone that is not communicated well does not exist — the presentation is part of the project.

Built on the ELI5 in 00-eli5.md. The move-in day is over. The house is built. Now you take the interviewer on a tour. How you conduct the tour determines whether they want to move in.

What interviewers are actually evaluating¶

See. Most candidates think interviewers are judging the model choice. "I used GPT-4 with RAG" sounds impressive. It is not.

Interviewers evaluate three things.

1. Problem clarity. Did you start with a user problem or a technology? Someone who started with a clear user job and worked backward to the model shows product thinking. Someone who started with "I wanted to try agents" shows technology-first thinking.

2. Engineering rigour. Did you measure anything? Precision@3 = 0.78. Latency p95 = 720 ms. Cost per call = $0.00049. Numbers show you treated it as engineering, not as a demo.

3. Honest failure. What went wrong and what did you learn? Every interviewer knows that demos are polished. The candidate who says "my first retrieval approach gave precision@3 of 0.61 and here is what I changed" is more credible than the one whose demo always worked.

Simple, no? Problem → measurement → honest failure. That is the tour.

The five-minute demo structure¶

A good capstone demo fits in five minutes without rushing.

┌─────────────┐
│ Problem     │
└──────┬──────┘
       ▼
┌─────────────┐
│ Architecture│
└──────┬──────┘
       ▼
┌─────────────┐
│ Live demo   │
└──────┬──────┘
       ▼
┌─────────────┐
│ Eval table  │
└──────┬──────┘
       ▼
┌─────────────┐
│ Next steps  │
└─────────────┘

Minute 1: Context
  "The system solves [user job statement].
   Here is who uses it and what they needed before this existed."

Minute 2: Architecture
  Draw or show the system diagram. One slide.
  "This is the blueprint. Here is the foundation. Here is the plumbing."
  Point at the diagram as you talk.

Minute 3: Live demo or recorded walkthrough
  Show one vertical slice working end-to-end.
  Show a real query. Show the retrieved chunks. Show the response.
  Show the latency number live.

Minute 4: Eval results
  Show the eval dashboard. Show the numbers.
  "Precision@3 is 0.78 against our target of 0.75."
  "The one metric we missed: latency p95 was 860 ms, just over the 800 ms SLA.
   Here is what I would fix next."

Minute 5: What I would do differently
  One honest reflection. One concrete next step.

Look. Five minutes. No padding. No live coding. The live demo takes exactly one minute. Numbers appear in minute four, not as an afterthought.

The written portfolio entry¶

A written portfolio entry supplements the demo. It lives in your GitHub README or a personal site. Target: 600–900 words. Not more.

Structure:

1. Problem (50 words)
   User job statement. Who benefits. What problem existed before.

2. System Design (100 words + one diagram)
   Blueprint summary. Architecture chosen and why.
   Key constraints that drove decisions.

3. Key Technical Decisions (200 words)
   Chunking strategy and why. Embedding model and why.
   Prompt structure. Caching decisions.
   Two or three decisions, each with "I chose X over Y because Z."

4. Eval Results (100 words + one table)
   The four core metrics. Numbers. Comparison to target.
   The one metric you missed and your plan.

5. What I Learned (100 words)
   One technical insight. One process insight.
   Written honestly. Not "everything went great."

6. What I Would Do Next (50 words)
   One concrete next step. Shows you are thinking beyond the demo.

See. The structure forces honesty. You cannot pad "what I learned" with generalities. You must name something specific that surprised you.

Worked example: eval results table for a portfolio¶

| Metric              | Target   | Achieved | Status |
|---------------------|----------|----------|--------|
| Precision@3         | ≥ 0.75   | 0.78     | ✓      |
| Grounding score     | ≥ 0.90   | 0.93     | ✓      |
| Format compliance   | ≥ 0.95   | 0.96     | ✓      |
| Latency p95         | ≤ 800 ms | 860 ms   | ✗      |
| Cost per call       | ≤ $0.002 | $0.00049 | ✓      |

The ✗ on latency is not a failure to hide. It is evidence of a real engineering problem you identified. Explain: "p95 latency exceeded the SLA. Root cause: LLM call spikes under load. Fix: implement streaming and evaluate on-premise model option."

This explanation is worth more to an interviewer than a polished table with all ticks. It shows you know the difference between demo-quality and production-quality.

Common portfolio mistakes and how to avoid them¶

Mistake 1: Leading with the technology. "I built a system using GPT-4, RAG, LangChain, Pinecone, and FastAPI." This is a dependency list, not a problem statement. Lead with the user job. The tech is supporting evidence.

Mistake 2: No numbers. "The system performs well with good accuracy and fast responses." This is marketing language. It says nothing. Replace every adjective with a number.

Mistake 3: No failure. "Everything worked as expected." Interviewers know this is false. It makes them distrust everything else you say. Include one genuine failure and what you learned.

Mistake 4: Demo-only validation. "Here are example outputs that look good." Example outputs are not evaluation. Show the eval table with the held-out test set results.

Mistake 5: No "what next." End with "the project is complete." This signals you are done thinking about it. End with one concrete improvement and why it matters.

Where this lives in the wild¶

Chip Huyen's ML interview guidance — problem framing before tech choices; numbers at every stage; honest failure analysis.
Eugene Yan's portfolio posts — show the decision, the alternative considered, and the measurement that justified the choice.
Hamel Husain's capstone projects — eval-first: always lead with the eval results, not the demo outputs.
Andrej Karpathy's project writeups — short, dense, honest; one clear diagram; no padding.
Stripe engineering blog — technical posts show what failed in the first version and what the learning was; trusted because they are honest.

Pause and recall¶

What are the three things interviewers are actually evaluating?
Name the five segments of the five-minute demo structure.
In the portfolio structure, what goes in the "Key Technical Decisions" section?
Why should you include a failed metric in the eval table rather than hiding it?

Interview Q&A¶

Q: "Walk me through a technical project you built from scratch."

A: I start with the user job. "Users needed X. Before my system, they had to Y." Then I describe the blueprint: the constraints that drove the architecture choice. Then I share one key technical decision — what I chose and what I did not choose and why. Then I share the eval results, including the metric I did not hit. Then I describe what I would do next.

Common wrong answer to avoid: "I built a RAG system using GPT-4 and Pinecone." This starts with the technology and never mentions the user, the constraints, or the measurements.

Q: "What metric did you miss in your capstone and what would you fix?"

A: This question is a gift. Answer it honestly and specifically. "My p95 latency was 860 ms against a target of 800 ms. The bottleneck was LLM call spikes under concurrent load. I would implement response streaming to reduce perceived latency and evaluate a local model to reduce actual generation time."

Common wrong answer to avoid: "All my metrics met the target." The interviewer knows this is almost certainly false. Claiming perfection costs you credibility.

Q: "Why did you choose RAG over fine-tuning for your project?"

A: Because the knowledge base updates weekly and the data was too small for effective fine-tuning. Fine-tuning requires hundreds of labelled examples and would not capture new articles until the next training cycle. RAG updates the index nightly and cites sources, which the users needed for trust.

Common wrong answer to avoid: "RAG is better than fine-tuning." Neither is better. The choice depends on the data freshness requirement and the volume of training data available.

Q: "If you had one more week, what would you change about your capstone?"

A: Answer with the specific engineering gap you identified. "I would implement semantic caching. I measured that 23% of queries are paraphrases of previous queries. Semantic caching would reduce LLM cost by approximately 20% and cut latency for those users by 90%."

Common wrong answer to avoid: "I would add more features." This shows you are thinking about adding complexity, not about engineering quality.

Apply now (5 min)¶

Write the five-minute demo script for your capstone. Time yourself. Cut anything that runs over. Write the eval results table. Include at least one metric with a ✗. Write the "what I would do next" section in 50 words.

Sketch from memory: Draw the portfolio structure six-section outline. Fill in one specific detail from your own capstone in each section.

Bridge. You can present the capstone. But what do you do when the whole system breaks in the demo? Debugging under pressure is a skill. We teach it next. → 12-integration-debugging.md