Skip to content

Phase 1 — Build the loop

Covers chapters 01–07. By the end, you have a ReAct refund agent talking to one tenant through an MCP server with typed tool schemas and discriminating descriptions. The agent works on the happy path and exits cleanly on the stopping rule. Phase 2 will harden everything; Phase 1 is about getting the bones right.


What you will add this phase

The agent loop and the tool surface. Specifically:

  • A think-act-observe loop with iteration cap and clean exit conditions.
  • A typed tool schema for each of five refund-domain tools.
  • Tool descriptions sharp enough to route the right tool on ambiguous prompts.
  • An MCP server exposing those tools, plus one resource and one prompt primitive.
  • An MCP client wrapping the model and the loop.

Nothing about blast radius, approval gates, multi-tenancy, observability, or eval gates appears in this phase. Those layers come later. Phase 1 is the smallest agent that genuinely deserves the name — and it is already enough work to expose how many decisions sit beneath a "simple" loop.


Chapters to read first

If you have read the module straight through, skim. If you have come back to the capstone after time away, re-read the Operational memory sections of these chapters:

The acceptance check at the end leans on these. If you can defend the three-question audit from chapter 01 and the categorisation rule from chapter 06, you are ready.


The build

Step 1 — Define the domain

Write down, in plain prose, the agent's job in three sentences. Reference NimbusPay as the operator; the customer messages are in English with the occasional Hindi or Tamil greeting; refunds are in INR; orders are identified by a six-digit numeric ID.

The acceptance criterion for this step is that a stranger reading your three sentences could write the toolbelt without further input from you. If the toolbelt is ambiguous from your description, sharpen the description.

Step 2 — Pick the leash

Apply the chapter-01 three-question audit to your prospective agent:

  1. What happens after the first tool result?
  2. What happens if a tool fails or returns partial data?
  3. What stops repeated bad attempts?

Write the answers as one paragraph. The leash you land on should be a ReAct loop — single-call cannot answer a refund flow that requires an order lookup then a policy check then a write — but if you have a justification for picking something else (a fixed pipeline, multi-agent), write that down too. The discipline is to pick deliberately, not to default.

Commit your three-sentence domain and your leash justification to a design-notes.md you will keep updating across all four phases. The design notes are part of what you defend in the final review.

Step 3 — Specify the tools

Five tools, narrowly scoped:

  • find_customer_by_email(email: str) -> {customer_id, tier, region}
  • list_orders(customer_id: str, since_days: int = 90) -> [{order_id, placed_at, amount_inr, status, days_since_placed}]
  • get_refund_policy(region: str, tier: str) -> {window_days, requires_approval_above_inr, escalation_team}
  • issue_refund(order_id: str, amount_inr: int, reason: str, idempotency_key: str) -> {refund_id, status}
  • send_customer_email(customer_id: str, body: str, idempotency_key: str) -> {status}

For each tool, write a JSON schema following chapter 03's eight-point template. Hit every point — name, description, properties with types, required, enums where relevant (reason is ["delay", "duplicate", "courtesy"]), bounds (amount in paise, 1 to 20_000_000), nesting where the input is structured, and additionalProperties: false.

The reason matters: reason as a free-text string would let the model produce "late_delivery" when your policy enum accepts only "delay". Force the categorisation at schema time and the model has nowhere to drift.

Step 4 — Write the descriptions

For each tool, write a description following chapter 04's four-part template:

[verb] [object] [qualifier].
Use when: <triggers>.
Do not use when: <neighbour-defer>.
Returns: <fields>.

Cap at 120 tokens per description. Worked example for list_orders:

List orders placed by a known customer over a bounded time window.
Use when: customer_id is already known from a prior lookup AND you need the order history.
Do not use when: you do not yet have customer_id — call find_customer_by_email first;
or when you have an exact order_id — call list_orders only for history, not for one record.
Returns: array of {order_id, placed_at, amount_inr, status, days_since_placed} for orders in the window.

The "do not use when" line is the high-leverage part. Without it, the model will sometimes call list_orders when it has only an email, or it will keep calling it after it already has a single order's details. Write the boundary explicitly.

Step 5 — Mock the tool backends

The agent should not need real APIs to develop against. Build a mock backend for each tool — a Python module that returns shaped data for a small set of seeded customers and orders. Three test fixtures cover the cases you will hit through every phase:

  • Priya, region IN, tier standard, order 448100 for ₹6,400 placed 14 days ago, status delivered.
  • Suresh, region IN, tier standard, order 882741 for ₹4,27,000 placed 9 days ago, status delivered, plus a dispute history flag for a double-charge on 12-Apr.
  • Karthik, region IN, tier standard, order 339182 for ₹6,400 placed 25 days ago, status delivered.

The Indian refund policy for standard-tier digital orders: 21-day window for refunds; refunds above ₹50,000 require approval; below that, auto-process. The mock get_refund_policy("IN", "standard") returns {window_days: 21, requires_approval_above_inr: 50_000, escalation_team: "finance_ops"}.

These three customers correspond to the running examples used in chapters 11b, 13, and 22. They will reappear in later phases.

Step 6 — Build the ReAct loop

Implement the chapter-02 three-beat loop in Python. Skeleton:

def run_agent(user_message: str, customer_email: str, max_iterations: int = 6) -> str:
    messages = [{"role": "user", "content": user_message}]
    for iteration in range(max_iterations):
        response = model.respond(messages, tools=TOOLS, system=SYSTEM_PROMPT)
        if response.stop_reason == "end_turn":
            return response.text
        if response.stop_reason == "tool_use":
            tool_block = response.tool_call
            tool_result = run_tool(tool_block.name, tool_block.input)
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": [
                {"type": "tool_result", "tool_use_id": tool_block.id,
                 "content": str(tool_result)}
            ]})
    return "stopped: max iterations reached"

That is the smallest honest loop. It has an iteration cap, an explicit stop on end_turn, and a tool-result feedback path. It has no give-up message yet, no cost tracking, no observability — those come next.

Step 7 — Write the system prompt

The system prompt is the operational spec for the loop. Keep it short — under 800 tokens. State the agent's identity, the goal, the available tools, the refund-policy summary, the format of the final reply to the customer, and the explicit stopping rule the model should respect when it believes the task is done (emit a final assistant message; do not call any more tools).

Avoid stuffing the citation policy or other static context into the prompt. That belongs in an MCP resource (next step).

Step 8 — Expose the tools through an MCP server

Move the tool implementations behind a FastMCP server named nimbuspay-refund-tools. Each tool becomes an @mcp.tool() decorated function. Add one resource — policy://nimbuspay/refund-standard-IN returning the policy text. Add one prompt — compose_customer_reply(refund_id, amount_inr, reason) returning a reusable reply template.

The agent then connects to the server as a client, lists capabilities, and dispatches calls. Three concrete benefits the hands_on_lab will exercise: the same server can be consumed by a second client in Phase 2 (multi-tenancy); the capability list becomes the inspectable permission surface in Phase 3 (observability); and Phase 4's kill switch can drop the entire server out of the toolbelt without redeploy.

Use stdio as the transport for now. Phase 2 will keep it; Phase 4's rollout exercise will discuss switching to HTTP for the remote case.

Step 9 — Run the three test fixtures

For Priya, the agent should fetch the order, check the policy (14 < 21 days), issue a refund of ₹6,400, and email confirmation. For Karthik, the agent should fetch the order, check the policy (25 > 21 days), and either escalate or return a courteous denial with a citation — there is no auto-refund. For Suresh, the agent should hit the policy's approval threshold (₹4,27,000 > ₹50,000) and not fire the refund tool; in Phase 1 it should stop with "this needs approval, escalating to finance_ops." Phase 2 will wire the actual approval gate.

Save the three traces in runs/phase-1/{priya,karthik,suresh}.json with the full message history. You will compare against these in later phases.


Worked example

Here is what one iteration looks like in the trace JSON.

{
  "iteration": 2,
  "think": "Need refund policy for IN region, standard tier, to compare against days_since_placed.",
  "act": {
    "tool": "get_refund_policy",
    "input": {"region": "IN", "tier": "standard"}
  },
  "observe": {
    "window_days": 21,
    "requires_approval_above_inr": 50000,
    "escalation_team": "finance_ops"
  },
  "check": "Priya's order is 14 days old, within the 21-day window, and ₹6,400 is below the ₹50k threshold. Proceed to issue refund."
}

Notice the four fields — think, act, observe, check. The observe field carries the payload, not just a success flag. The check field carries the decision, not just a "got it." This is the level of structure the trace file should encode for every iteration; it is what makes the trace inspectable in Phase 3.


Acceptance check

Before moving to Phase 2, you should be able to answer each of the following plainly, without re-reading the chapter prose:

  1. Why is the agent a ReAct loop rather than a fixed pipeline? Your answer should reference the dependency between order-lookup and refund-issue — the second step's argument is the first step's output, and the agent must observe the result before committing to the second step.
  2. Show me the schema for issue_refund. You should be able to recite the eight points and gesture at why each one matters — amount_cents rather than amount, reason as enum, idempotency_key as required (even though we won't use it until Phase 2; reserving it now avoids a breaking schema change later).
  3. What's the "do not use when" line on list_orders? If it isn't on the page, the description is incomplete.
  4. Where does the citation policy live — in the system prompt, in a tool's return value, or in an MCP resource? The answer is resource, and you should be able to defend that choice using chapter 06's categorisation rule.
  5. What happens if the model calls issue_refund with amount_inr=4_27_000 on Suresh's case in this phase? The answer is "the tool fires" — there's no gate yet. That is the failure Phase 2 will fix. Make sure you have noticed it.

If any of these answers feels uncertain, the layer isn't really built. Re-open the matching chapter's Operational memory section before moving on.


Common stumbles

Stumble 1 — the model picks the wrong tool because two descriptions overlap. Symptom: the agent calls list_orders when it should have called find_customer_by_email (or vice versa). Diagnosis: your descriptions don't include explicit boundaries against neighbour tools. Fix: rewrite each tool's "do not use when" line to name the neighbour by tool name (Do not use when: ... — call find_customer_by_email first).

Stumble 2 — strings where enums would be safer. Symptom: reason arrives as "late_delivery" or "refund_for_delay" instead of one of the three accepted enum values. Fix: revisit chapter 03; closed sets collapse to closed sets only if the schema declares the enum. The model invents wording inside any field typed as "string".

Stumble 3 — system prompt bloat. Symptom: by the time you've added "here are the tools, here is the policy, here is the format, here is the refusal rule, here are some examples," the system prompt is 2,500 tokens. Fix: cut everything that should be a resource (policy), keep only operational spec. Resources live in MCP; the system prompt should be the loop's contract, not the agent's knowledge base.

Stumble 4 — no iteration cap. Symptom: a buggy tool causes the agent to loop indefinitely on a development machine, and you discover the cost only when the API bill arrives. Fix: enforce the cap in your Python for loop, not in prose to the model. Numbers in prompts are suggestions; numbers in code are contracts.


Reflection prompts

Answer these in design-notes.md before moving on. They are not graded; they are the questions Phase 2 will pick up from.

  • Why did you pick a 6-iteration cap (or whichever number you chose)? What evidence do you have that 6 is enough? What's the cost in tokens if every iteration runs?
  • Walk through what would happen if list_orders returned an empty list for a customer with valid orders. Does the agent recover, or does it confidently tell the customer they have no refundable orders?
  • Your MCP server exposes five tools, one resource, one prompt. If a second agent (the Acme support agent) wanted to consume the same server, what would have to change? Hold this thought; Phase 2 makes you actually do it.
  • Take any one description you wrote in Step 4. Imagine the description was lost and only the schema remained. Would the model route correctly? This is the chapter-04 audit applied to your own work.

Continue to phase-2-bound-the-blast.md.