Skip to content

05. Implementation Strategy — Build Order, Vertical Slices, Iteration Discipline

~11 min read. The order you build things in determines what you learn and when — get it wrong and you debug the wrong thing for weeks.

Built on the ELI5 in 00-eli5.md. The foundation — the infrastructure choices everything rests on — must be laid before walls go up. Build order matters as much as build quality.


The wrong build order and why teams do it

See. Almost every first-time capstone team builds in this order:

  1. Build and tune the LLM prompt until it looks great.
  2. Add retrieval.
  3. Add the UI.
  4. Add evaluation.
  5. Try to deploy.

Then at step 5 they discover the retrieval was wrong all along, and the beautiful prompt was compensating for bad context. The evaluation at step 4 catches failures that were built in at step 2. Everything must be rebuilt.

Simple, no? Build order is not a matter of preference. It is a matter of what feedback arrives earliest. Bad feedback arriving late is expensive.


The right build order: outside-in

Build from the user backward, not from the model forward.

Phase 1: End-to-end skeleton (no intelligence)
─────────────────────────────────────────────
User input → [stub retriever returns fixed text] → [LLM call] → UI display

Purpose: Test the interface contracts before any real components exist.
Time: 1 day.

Phase 2: Retrieval (the plumbing)
─────────────────────────────────
Real retriever → measure retrieval recall on 50 test queries.
Iterate on chunking and embedding until recall ≥ target.
Do NOT touch the prompt yet.
Time: 2–4 days.

Phase 3: Generation (the model)
────────────────────────────────
Connect real retriever to prompt. Measure end-to-end quality on 50 test cases.
Iterate on prompt and context assembly.
Time: 2–3 days.

Phase 4: Evaluation (the inspection)
──────────────────────────────────────
Build the full eval suite against the working system.
Establish a baseline. Lock it.
Time: 1–2 days.

Phase 5: Hardening
───────────────────
Add error handling, retries, logging. Test edge cases.
Time: 1–2 days.

Phase 6: Deployment (move-in day)
───────────────────────────────────
Deploy. Run canary. Monitor. Iterate.
Time: 2–3 days.

Look. Phase 1 exists only to test interfaces before you invest in components. The foundation is Phase 1 — the skeleton. Do not skip it.


Vertical slices: the key technique

A vertical slice is a thin path through all layers for one narrow user scenario.

Bad (horizontal):
  Build entire retrieval system ──▶ then entire prompt system ──▶ then entire UI

Good (vertical):
  Slice 1: User asks "What is the return policy?"
  ──▶ retriever handles this one query type
  ──▶ prompt handles this one query type
  ──▶ UI displays this one result
  All the way through. Working. Tested. Ship-ready.

  Slice 2: User asks "How do I track my order?"
  ──▶ Same stack, new query type. Extend retriever, prompt, UI.

Why vertical slices? Each slice is independently demonstrable. Each slice is independently testable. Progress is visible to stakeholders at the end of every day. If you run out of time, you have a working system — just fewer slices.

Horizontal builds give you nothing shippable until the very end. Vertical slices give you something shippable after day two.


Worked example: slice schedule for support assistant

Day 1-2:  Skeleton. Stub retriever, real LLM call, plain text UI.
          Goal: full interface chain working. No intelligence yet.

Day 3-4:  Retrieval slice. Connect real KB. Test 50 queries.
          Measure: Precision@3. Target: ≥ 0.75.
          Current: 0.61. Problem: chunks too large.
          Fix: reduce chunk size from 600 to 350 tokens.
          Re-test: 0.78. ✓

Day 5-6:  Generation slice. Connect retriever to prompt.
          Test 50 end-to-end cases. Measure: correct resolution rate.
          Target: ≥ 0.70. Current: 0.66.
          Fix: add explicit instruction "cite the article title."
          Re-test: 0.73. ✓

Day 7:    Evaluation slice. Build eval suite. Baseline locked.
          Precision@3 = 0.78. Resolution rate = 0.73. Latency p95 = 720 ms.

Day 8-9:  Edge case slice. Empty results, ambiguous queries, off-topic queries.
Day 10:   Deployment slice. Canary to 5% of agents. Monitor.

See. Every day ends with something that works. Nobody waits until day 10 to find out if retrieval works.


Iteration discipline: when to stop tuning

A common failure mode: endless prompt tuning. You try variation after variation on 10 test queries. Your score goes from 7/10 to 9/10. You feel done.

But the eval suite has 200 queries. Run it. Score: 58/200 = 0.29. The 10 queries you hand-tuned were not representative.

Iteration discipline rules: 1. Always evaluate on a held-out set, not the training/tuning set. 2. Set a threshold before you start tuning. Stop when you hit it. 3. Track every experiment. Record the change and the delta in score. 4. Two failed experiments in a row: revisit the architecture, not the prompt.

The inspection — the eval suite — is the only honest feedback. Do not substitute your intuition for it.


Where this lives in the wild

  • Spotify — ML teams use "model-in-the-middle" vertical slice: recommendation model slice covers one genre before expanding.
  • Airbnb — ML features launched in vertical slices by geography; one country at a time to catch interface bugs early.
  • Stripe — fraud model shipped in shadow mode first (horizontal phase 1 equivalent) before any real decision-making.
  • GitHub Copilot — retrieval (code context) was validated separately from generation quality before the two were connected.
  • Linear — AI triage built skeleton with random labelling before any real model, to test the UI and pipeline interfaces.

Pause and recall

  1. What is wrong with the common build order (prompt first, retrieval second)?
  2. Define "vertical slice" in one sentence without looking up.
  3. In the schedule example, what problem was found on day 3-4 and how was it fixed?
  4. What are the four iteration discipline rules?

Interview Q&A

Q: "How do you structure the build of a new AI feature to avoid wasting time?"

A: I use vertical slices. First I build a skeleton that tests all interface contracts with stub components. Then I replace one stub at a time, verifying each slice end-to-end before moving to the next. I evaluate on a held-out set, not on the queries I tuned on.

Common wrong answer to avoid: "I build each component fully before connecting them." Horizontal builds hide interface bugs until the end, when they are most expensive to fix.


Q: "When should you stop tuning a prompt and change the architecture instead?"

A: When two consecutive prompt experiments produce no meaningful improvement on the held-out eval set. At that point, the bottleneck is no longer the prompt — it is the architecture, the retrieval quality, or the data. Continuing to tune prompts against a structural failure wastes time.

Common wrong answer to avoid: "Keep tuning until it works." Diminishing returns on prompt tuning are a signal, not a challenge.


Q: "How do you show progress on an AI project to non-technical stakeholders?"

A: By demoing vertical slices. Each slice is a complete user scenario that works end-to-end. At the end of each sprint, I demo the slice. Stakeholders can see tangible progress and provide feedback on real functionality, not on mock-ups.

Common wrong answer to avoid: "I show a Jupyter notebook with example outputs." Notebook demos do not represent production system behaviour.


Q: "You have two weeks to build a capstone. How do you prioritise?"

A: Phase 1: skeleton (2 days). Phase 2: retrieval slice for the most important query type (3 days). Phase 3: generation for that slice (2 days). Phase 4: eval baseline (1 day). Phase 5: one more slice if time allows (3 days). Final day: cleanup and demo prep.

Common wrong answer to avoid: "I build everything and then see what works." Full-system builds with no intermediate validation almost always fail at integration.


Apply now (5 min)

Map your capstone to the six-phase build order. Write a day-by-day schedule for your specific project. Identify your first vertical slice: which one user scenario will you build end-to-end first? Define "done" for each day — a measurable outcome, not "worked on it."

Sketch from memory: Draw the horizontal vs. vertical slice diagrams. Label which is which and write one disadvantage of horizontal under the diagram.


Bridge. With the build order locked, you write the system prompt and few-shot examples. The blueprint specified what to build. Now we write what the model actually reads. → 06-prompt-engineering-project.md