Shorter exercises¶
Three 30–60 minute exercises that rep one chapter's discipline in isolation. Use these when you want a quick drill on a specific decision rather than the full capstone. Each one is self-contained — no shared state with the capstone — and produces a written artifact you can save next to your design notes.
Exercise 1 — Read a trace and find the failure¶
~45 min. Exercises chapters 17 (failure mode by topology) and 18 (observability).
A customer-support agent at a different company (not NimbusPay) ran a refund flow that ended badly. The customer received a confirmation email for a refund that never completed. The trace is below; your job is to read it and locate the failure layer.
The trace¶
trace_id=t-991 tenant_id=initech session_id=s-44 prompt_version=v17 model=claude-3-5-sonnet-latest
[step_0] agent_step
├── llm_call input_tokens=420 output_tokens=85 cost_usd=0.003
│ decision: "look up customer for support+initech@example.com"
├── tool_call name=find_customer_by_email latency_ms=140 status=success
│ args: {email: "support+initech@example.com"}
│ result: {customer_id: 882, tier: "standard", region: "IN"}
│
[step_1] agent_step
├── llm_call input_tokens=620 output_tokens=110 cost_usd=0.004
│ decision: "list orders, retrieve policy in parallel"
├── tool_call name=list_orders latency_ms=210 status=success
│ args: {customer_id: 882, since_days: 90}
│ result: [{order_id: "448100", placed_at: "2026-05-08", amount_inr: 6400, status: "delivered"}]
├── tool_call name=retrieve_refund_policy latency_ms=190 status=success
│ args: {query: "refund window", top_k: 4}
│ result: 4 chunks returned, region_filter_applied=ANY, confidence=[0.71, 0.68, 0.66, 0.65]
│
[step_2] agent_step
├── llm_call input_tokens=1840 output_tokens=140 cost_usd=0.012
│ decision: "policy says 7-day window; order is 14 days; outside window; deny politely"
├── tool_call name=send_customer_email latency_ms=380 status=success
│ args: {customer_id: 882, body: "...your refund cannot be processed; outside window..."}
│ result: {status: "sent"}
[step_3] agent_step
├── llm_call input_tokens=2100 output_tokens=60 cost_usd=0.014
│ decision: "wait — re-checking. apologise, override and refund"
├── tool_call name=issue_refund latency_ms=280 status=success
│ args: {order_id: "448100", amount_inr: 6400, reason: "delay",
│ idempotency_key: "refund_448100_delay_v2"}
│ result: {refund_id: "rf_993", status: "issued"}
├── tool_call name=send_customer_email latency_ms=410 status=success
│ args: {customer_id: 882, body: "your refund of ₹6,400 has been issued..."}
│ result: {status: "sent"}
[final]
total_cost_usd: 0.033 total_latency_ms: 3120
Your task¶
The customer received two emails — first a denial, then a confirmation — and the refund did fire. But the customer's complaint to support says the refund never arrived. The bank shows the refund as initiated but in a "pending" state for the past 72 hours.
Find the failure. There are at least three architectural problems in this trace. Write each one down, name the chapter it violates, and propose the fix.
Hints if you get stuck¶
- Look at the
retrieve_refund_policycall'sregion_filter_appliedvalue. - The two emails are not equivalent. Read both bodies. What did the customer see in order?
- The model in the trace is pinned by alias, not snapshot. What does chapter 21 say about that?
Expected findings (don't read until you've tried)¶
Click to reveal
**Finding 1 — wrong region retrieved.** The retrieve call carries `region_filter_applied=ANY`. The agent is running for `tenant=initech, region=IN` but pulled chunks from any region. The top chunks have confidence in the 0.65–0.71 range, which is borderline. The agent paraphrased a 7-day window that almost certainly came from a non-Indian policy supplement. The correct Indian policy is 21 days, and the order was 14 days old — well within window. The whole denial was based on the wrong policy. Violates chapter 11b (retrieval region filter must be required and enforced). **Finding 2 — no rollback on the contradiction.** Step 2 fired the denial email. Step 3 reversed the decision and fired the refund and a second email. The customer received both emails in sequence. The denial email was never explicitly retracted — the customer only saw two messages and concluded the second one might be a system error. The agent should have detected its own contradiction (Step 3 reading Step 2's tool call in its own scratchpad) and either escalated, or sent a single corrective message that explicitly retracted the prior denial. Violates chapter 11 (chained writes should not contradict prior writes without a corrective frame). **Finding 3 — `-latest` model alias in production.** The trace's `model=claude-3-5-sonnet-latest` is an alias. The provider may have rolled the alias to a new snapshot between Step 2 and Step 3, which would explain why the decision logic flipped mid-trajectory. Even if the alias didn't roll mid-trace, the trace cannot be reproduced because there's no snapshot ID logged. Violates chapter 21 (pin snapshots, never `-latest`). **Stretch finding — pending refund status.** The tool call returned `status: "issued"`, but the customer's bank shows the refund still in `pending`. The agent treated `issued` as terminal; in reality, the refund processor has a separate `pending → completed` step that the agent never verifies. The agent should have either polled for completion or surfaced the pending state to the customer with a 72-hour ETA. Violates chapter 11b/Phase 2 acceptance check (verify side effects, don't assume they completed).Reflection¶
Write down: which of the three (or four) failures would your Phase 3 agent have prevented architecturally? Which would have shipped through your Phase 3? Add the gaps to your regression set if any are missing.
Exercise 2 — Architecture review on a proposed agent¶
~60 min. Exercises chapter 23 (architect's checklist) and chapter 24 (honest admission).
A colleague brings the following proposal to your design review.
The proposal (verbatim from the design doc)¶
AI agent for IT support ticket triage
We want to build an agent that reads incoming IT support tickets from our internal queue, classifies them, looks up the requester's account, checks recent system status, and auto-resolves common categories — password resets, VPN-config requests, license hands_on_labs, software-install approvals. Estimated time savings: 40% of IT team capacity. Initial scope: 3 ticket categories, expandable.
Tools available to the agent:
read_ticket(ticket_id)— fetches the ticketlookup_user(email)— internal directoryreset_password(user_id)— issues a reset emailassign_license(user_id, license_type)— assigns a software licenseapprove_install(ticket_id, software_name)— auto-approves a software install requestupdate_ticket(ticket_id, status, note)— updates the ticketescalate_to_human(ticket_id, reason)— routes to the human queueLoop: ReAct, max 5 iterations. System prompt covers the three categories and the policy.
Evals: we have 50 tickets we'll use as a gold set. Looking for 80% match against expected outcomes.
Rollout: we'll feature-flag it on, monitor, and expand the categories over time.
Kill switch: we can disable the feature flag in our config service if needed.
Your task¶
Run the chapter-23 twenty-item checklist against this proposal. For each item, mark green (clearly covered), yellow (mentioned but underspecified), or red (missing or wrong). Also run the chapter-23 five-question smell test.
Write a one-page review with:
- The four-column scoreboard (design / build / launch / run, with status per row).
- The hard blockers (chapter 23 says: any single one stops the launch — no kill switch, no eval gate, irreversible action without approval, no observability on writes).
- The soft gaps that can be compensated by smaller blast radius (1% canary, shadow mode, hard gateway caps).
- The chapter-24 admission section — which of the five open architectural questions does this team have no defensible answer for, and how would you ask them to compensate operationally?
Hints if you get stuck¶
assign_licenseandapprove_installare not trivially reversible — what does chapter 8 say about Class 3 vs Class 4?update_ticketwrites to a shared ticket queue. What does chapter 10 say about parallel writes to the same target?- "We have 50 tickets" — chapter 19 has an opinion on dataset size for the capability gate. Does 50 cases mean anything statistically?
- The proposal says "kill switch: disable the feature flag." What does chapter 20 say about kill-switch contracts? Has this team drilled the flip?
- The proposal doesn't mention multi-tenancy or per-team isolation. Is the IT queue tenant-shared? Chapter 15 has thoughts.
Expected outcome¶
You should end with at least four hard blockers and two-to-three soft gaps. The proposal as written is not shippable; it's a prototype description with launch language attached. Your review's job is to surface that without being uncharitable — every item you mark red should come with a concrete suggestion the team can act on. The chapter-23 worked example for the Jira-auto-resolve proposal is the model for tone and density.
Exercise 3 — Write a rollout plan for an existing agent¶
~30 min. Exercises chapter 20 (rollout and kill switch).
You have just inherited an agent that someone else built and shipped. The agent is live at 100% of traffic for one tenant; it serves roughly 5,000 turns per day on customer-support FAQs. It has no rollout staircase, no kill switch beyond "redeploy with the agent disabled," no auto-rollback triggers, and a single dashboard nobody on-call looks at.
Your task is to retroactively build the rollout discipline the agent should have had at launch.
Step 1 — The kill switch¶
Write the kill-switch contract for this agent. Specifically:
- Feature flag mechanism (which config service, what file, what API).
- Fallback option (chapter 20 A/B/C — pick one and justify).
- Propagation time target (sub-60-second per the chapter).
- Audit logging on flip (who flipped, when, why).
- Drill schedule (initial drill before any change, then quarterly).
Write the contract as if it were going on the team wiki.
Step 2 — The retrofit ramp-down-then-ramp-up¶
Since the agent is already at 100% with no rollout discipline, the right move is not to leave it at 100%. Propose a retrofit: ramp down to 50%, re-canary to 1% on a new version with the rollout discipline, then ramp back up using the chapter-20 staircase.
Write the ramp-down-and-re-canary plan with hold times and rollback triggers. The honest framing is that this team is paying the retrofit cost because the original launch skipped Phase 4.
Step 3 — Auto-rollback triggers¶
Pick three concrete triggers and their thresholds:
- One on safety (error rate, latency, eval score).
- One on quality (HITL escalation, customer-complaint signal).
- One on cost (budget overshoot).
For each, decide whether it's an auto-rollback (safety-class) or a freeze-ramp (budget-class). Chapter 20's Operational memory is the rubric.
Step 4 — The 3 AM kill drill¶
Write the drill script — what the on-call sees, what they do, how it's timed, who reviews the drill afterward. Include a failure-recovery branch: what if the flag doesn't propagate within 60 seconds? What's the backup?
Expected outcome¶
A one-page rollout-retrofit plan that the inheriting team could execute next week. The discipline is to retrofit what should have shipped at launch; the cost of doing it now is higher than doing it at launch would have been (because the team has to ramp down and back up rather than ramping up from scratch), but the cost of not doing it is the next 3 AM incident.
Where to go after these¶
The three exercises rep three different architecture-review skills: reading a trace forensically, auditing a colleague's proposal, retrofitting discipline onto inherited code. Together they're roughly two-thirds the work of one phase of the capstone, but exercise specific muscles the capstone doesn't drill as hard.
If you're preparing for a senior or staff-level interview, do all three. The trace-debug exercise simulates the kind of "walk me through this incident" question that comes up in operations-heavy interviews. The architecture-review exercise simulates the kind of "review this proposal" question that comes up in design-heavy interviews. The rollout-retrofit exercise simulates the kind of "you inherited X, how do you stabilise it" question that comes up at the staff level.
If you're working on a real production agent, do the exercises in the order that matches your current pressure. New agent in design → exercise 2. Live agent without rollout discipline → exercise 3. Mysterious recurring incident → exercise 1.
Return to README.md or revisit a specific phase.