Designing Agents — Capstone Assignment¶

Module 16 covered twenty-four chapters of architectural decisions. This folder is where you exercise them on a single running build, end to end. The capstone is one agent shipped through every phase of the lifecycle — design, build, launch, run.

What you will build¶

A refund-handling agent for a fictional Indian fintech, NimbusPay, that serves two tenants (Acme and Initech) with very different plans and risk profiles. The agent answers customer messages, looks up orders, checks refund policy, decides eligibility, gates large refunds for human review, and ships through shadow → canary → ramp → 100% with a working kill switch.

The same agent — the same model, the same toolbelt — is rebuilt four times across four phases, each phase adding one architectural layer. By the end, you have one agent that exercises every chapter's discipline and a folder of artifacts (traces, evals, runbooks) you can show in an architecture review.

The refund domain is chosen because it is concrete, high-stakes, and threads naturally through every chapter — leash choice, schema design, blast radius, approval gates, multi-tenancy, observability, eval gates, rollout. You will recognise the running examples from the chapter prose; the hands_on_lab lets you operate them.

Why one agent across four phases, not four separate exercises¶

Most agent tutorials show a different toy for each concept. The cost is that the concepts never compose. In production, every architectural decision interacts with every other — your blast radius decision constrains your approval-gate decision, which constrains your latency budget, which constrains your model-routing decision. The only way to feel those interactions is to build one agent across all the decisions.

So the capstone is deliberately cumulative. Phase 1's loop becomes Phase 2's safeguarded loop. Phase 2's safeguards become Phase 3's observable safeguards. Phase 3's observability is what Phase 4's eval gates and rollout depend on. Each phase reads the previous phase's code and adds one layer; nothing gets thrown away.

The four phases¶

phase-1-build-the-loop.md — covers chapters 01–07. Build the agent loop, the typed tool schemas, the descriptions, and expose the tools through an MCP server. Output: a working ReAct refund agent that talks to one tenant.
phase-2-bound-the-blast.md — covers chapters 08–15. Classify every tool by blast radius, add idempotency keys, wire stopping rules, design the parallel-vs-chain schedule, add an approval gate above ₹50,000, set the cost-and-latency budget, and introduce per-tenant isolation. Output: the same agent, now safe for production.
phase-3-survive-production.md — covers chapters 16–18. Add state-recovery checkpoints, audit the topology for its first crack, wire the observability spans an on-call engineer would need at 3 AM. Output: the same agent, now debuggable.
phase-4-ship-with-discipline.md — covers chapters 19–24. Build the six-gate yardstick, write the rollout plan (shadow → 1% → 10% → 50% → 100%), wire the kill switch, set up the prompt/tool/model version tuple, and run the 3 AM kill drill. Output: the same agent, now shippable.

The four phases together are roughly 12–20 hours of focused work. Splitting across multiple sessions is fine; the four phase markdowns are designed to pick up cleanly.

Shorter exercises¶

The capstone is the main artifact. The companion file exercises.md has three shorter exercises — read a trace and find the failure, audit a colleague's proposed agent design, write the rollout plan for an existing agent — that take 30–60 minutes each and exercise specific chapters in isolation. Useful when you want a quick rep on one decision rather than a full build.

Prerequisites¶

Python 3.11+ with anthropic or openai SDK installed and an API key with a working tool-use endpoint.
Optional but recommended: langchain or langgraph (used in Phase 3 for checkpointer comparison), mcp Python SDK (used in Phase 1 for the MCP server), Postgres or SQLite (Phase 3 checkpointing), and any tracing tool that supports OpenTelemetry-style spans (Langfuse, Helicone, Datadog, or hand-rolled JSONL works).
A familiarity with at least chapters 01, 02, 03, 08, and 13 before starting Phase 1. Each phase points back to the chapters it draws on.

You do not need a cloud account or a real banking integration; every "external" tool is a local mock that returns shaped data. The discipline being practiced is architectural, not infrastructural.

What "done" looks like for the capstone¶

At the end of Phase 4 you should have, sitting in a single repo, the artifacts a real production architect would carry into a launch review:

An agent that handles the running refund task end to end on at least two tenants.
A toolbelt with typed schemas, blast-radius classes named per tool, idempotency keys on every write, and an MCP server exposing them.
A stopping rule wired as an OR-gate across iterations, tokens, time, and cost — with an honest give-up message.
An approval-gate spec for refunds over ₹50,000, with a packet, timeout, and resume policy.
A scratchpad schema with five named keys and a checkpoint between policy-check and refund-issue.
A trace from a representative session with every span carrying tenant_id, trace_id, cost_usd, model_version, and prompt_version.
A six-gate yardstick spec with the threshold and sign-off owner for each gate, and at least 30 labelled eval cases against the capability gate.
A rollout plan with shadow duration, canary percentage, hold times, and three auto-rollback triggers.
A working kill switch — flag-based, under 60 seconds to flip — that you have personally drilled.
A versioning manifest pinning prompt-hash, tool-set-version, and model-snapshot-ID per release.

If you ship the artifact set, you can defend the design in any senior architecture review you walk into. That is the point.

How to use the phase markdowns¶

Each phase markdown has the same shape:

What you will add this phase — the layer being introduced, why it sits where it does.
Chapters to read first — the chapters the phase draws on; skim what you forget.
The build — concrete numbered steps, each one a small commit's worth of work.
Worked example — what one decision looks like in code or in the artifact you produce.
Acceptance check — a question you should be able to answer plainly before moving to the next phase. If you can't answer it, the layer isn't really built.
Common stumbles — three or four places previous learners got stuck and how to recover.
Reflection prompts — short questions whose answers should make their way into your design notes.

Treat the phase as a punch list, not a script. You should make decisions, not follow recipes. When the phase says "pick a stopping rule combination," you pick — the reflection prompts are there precisely to test whether the choice was deliberate or default.

Honest expectations¶

You will hit edge cases the chapters did not name. That is the point; production has more failure modes than any module covers.
Your first pass on each phase will be wrong somewhere. The acceptance checks exist to catch the most common wrongs; deeper bugs surface during the kill drill and the rollout simulation in Phase 4.
The capstone is harder than it looks because it makes you commit. Chapters let you read about leash length; the capstone makes you pick one and defend it.

If at any point you feel the hands_on_lab is going in circles, re-read the relevant chapter's Operational memory section. That section is what the chapter promised you would carry away; if you can't apply it in the build, the chapter didn't land and re-reading the prose for that section is the right move.

Start with phase-1-build-the-loop.md.