Skip to content

02. Agents & Tool Calling — Narrative Explainer

Companion to 03_study_material.md. The study material gives the terms. This file gives the moving picture.

Table of contents

  • ELI5 — the handyman with a toolbelt (start here)
  • Chapter 1: The first production embarrassment
  • 1.1 One good tool call is not an agent
  • 1.2 Why the stakes are higher now
  • Chapter 2: The ReAct loop
  • 2.1 Think → Act → Observe
  • 2.2 Why the loop works
  • 2.3 Loop implementation patterns
  • 2.4 ReAct vs chain-of-thought alone
  • Chapter 3: Tool design
  • 3.1 Tools are interfaces, not wishes
  • 3.2 Schema definition
  • 3.3 Descriptions guide selection
  • 3.4 Error handling and retries
  • 3.5 Idempotency
  • 3.6 Pydantic schemas
  • Chapter 4: Failure modes & guardrails
  • 4.1 Infinite loops
  • 4.2 Wrong tool selection
  • 4.3 Hallucinated arguments
  • 4.4 Stopping rules and give-up rules
  • 4.5 Human-in-the-loop gates
  • 4.6 Cost controls
  • Chapter 5: Advanced patterns
  • 5.1 Parallel tool calls
  • 5.2 Tool chaining
  • 5.3 Dynamic tool selection
  • 5.4 Memory and state across turns
  • 5.5 Retrieval prompts
  • 5.6 Honest admission
  • 5.7 Foundation-gap audit for module 10
  • 5.8 Bridge to next module
  • Chapter 6: Recap & application
  • 6.1 The failure-fix table
  • 6.2 Key points to remember
  • 6.3 Important interview questions
  • 6.4 Production experience
  • 6.5 Apply now — graded exercises
  • 6.6 Final retrieval

ELI5 — the handyman with a toolbelt

Imagine a handyman visiting your house. You say, "My sink is leaking." The handyman does not randomly swing a hammer. He first looks. He thinks. He checks which tool fits the problem. That is an agent. Keep these names in your head: - the toolbelt = the available tools - the think step = reasoning and planning - the try = the tool call - the check = reading the result and judging it - the give-up rule = max iterations or a confidence threshold Now picture the full loop.

problem arrives
think step
pick from the toolbelt
try
check
worked?
  ↙      ↘
yes       no
↓         ↓
finish    think again
        give-up rule?
         ↙        ↘
       yes         no
       ↓           ↓
  call specialist  try again
Suppose the sink is leaking because a nut is loose. The handyman picks a wrench. That is a good try. Suppose the leak continues. Now the handyman checks again. Maybe the washer is torn. Now plumber's tape or a replacement washer helps. Please notice the pattern. The value is not the wrench alone. The value is the loop. A language model with one tool call is like a handyman who uses one tool once. Sometimes that is enough. Often it is not. Real problems are messy. They need a sequence. They need checking. They need correction. That is why agents matter. If the handyman never checks, he may tighten the wrong joint. The leak may worsen. If the handyman never gives up, he may keep poking the same pipe forever. That is a bad agent loop. If the handyman has ten confusing tools, all named badly, he may pick a saw for a plumbing issue. That is bad tool design. If the handyman must touch the building's main water line, he should stop and ask the owner. That is human-in-the-loop. If the handyman keeps opening expensive instruments, he is wasting time and money. That is cost control. So, in child words, an agent is simply this: 1. Understand the problem. 2. Pick a tool. 3. Use it. 4. Check the result. 5. Repeat carefully. 6. Stop when fixed or when the give-up rule triggers. Very simple idea. Very difficult engineering. Keep the handyman picture with you. Every chapter later is just this picture, made precise.


Chapter 1: The first production embarrassment

1.1 One good tool call is not an agent

You give an LLM access to a calculator tool. You test it with a simple prompt. "What is 37 × 18?" The model calls the calculator correctly. You feel smart. You feel safe. You think, "Done. Tool calling works." Then the real query arrives. "A warehouse has 17 boxes. Each box has 24 batteries. 13 batteries are defective. The rest are packed into cartons of 7. How many full cartons can we ship?" This problem has multiple steps. 1. Multiply 17 × 24 2. Subtract 13 3. Divide by 7 4. Take the floor Now watch the embarrassing trace.

User:
A warehouse has 17 boxes. Each box has 24 batteries.
13 batteries are defective. The rest are packed into cartons of 7.
How many full cartons can we ship?
Assistant tool call:
calculator({"expression": "17 * 24"})
Tool result:
408
Assistant final answer:
After removing 13 defective batteries, 398 remain.
That makes 56 full cartons.
Two failures happened. First, 408 - 13 is not 398. It is 395. Second, 395 / 7 is 56 remainder 3. The answer happened to stay 56. So the first wrong number got hidden. This is the worst kind of failure. The final answer looks plausible. The stepwise state is broken. One correct tool call gave you false confidence. The model used the tool for step 1, then hallucinated step 2. This happens constantly. Why? Because tool access alone is not enough. The model needs a policy. It needs a loop. It needs a habit of checking. A single round of tool calling says, "Here is one moment when the model can act." An agent says, "Keep acting and observing until the task is truly done." That difference is everything.

Tiny log, big lesson

Let us write the same flow properly.

User asks multi-step math question
Model thinks: I need exact arithmetic for every step
Tool call 1: 17 * 24
Observe: 408
Tool call 2: 408 - 13
Observe: 395
Tool call 3: 395 // 7
Observe: 56
Answer: 56 full cartons
Now the tool is not a garnish. It is part of the workflow.

Why this failure generalizes

Math is the easy example. The same failure shows up everywhere. - The agent searches a knowledge base once, then invents a second fact. - The agent reads the customer profile, then guesses the subscription tier. - The agent checks the calendar, then hallucinates available slots. - The agent creates a draft email, then claims it already sent it. If the next step depends on external truth, you must usually check again. That is the deeper principle. Agents exist because the world is outside the model.

1.2 Why the stakes are higher now

For 2025-2026, agents are the #1 production pattern. That sentence is not hype. It is workflow economics. Why are teams shipping agents? Because many valuable tasks are not one-shot generation tasks. They are: - look something up - compare two sources - calculate something exactly - write into a system - verify the write succeeded - ask a follow-up question - escalate when confidence is low That is a loop. Customer support is a loop. Operations is a loop. Data cleanup is a loop. Developer assistants are loops. Research assistants are loops. The model is strong at language. The tools are strong at reality. The loop is what marries them. Without the loop, you get demo reliability. With the loop, you at least have a chance at production reliability. Still, please do not romanticize agents. An agent is not magic. It is just careful iteration around uncertainty. That is why the basics matter. If you do not understand the loop, module 10 will feel decorative. MCP, multi-agent systems, and orchestration layers all assume one thing first: that you can run one agent safely. So this module is not optional groundwork. It is the foundation.


Chapter 2: The ReAct loop

2.1 Think → Act → Observe

The standard agent loop is called ReAct. It comes from a very simple insight. Reasoning alone is not enough. Acting alone is not enough. You need both, interleaved. The canonical pattern is: 1. Think — decide what to do next 2. Act — call a tool or produce an answer 3. Observe — read the result 4. Repeat until done In many papers, this is written as Thought → Action → Observation. In practical software, you should think of it as:

plan a tiny next move
do the move
read what actually happened
update your internal state
plan the next move
This is why ReAct feels powerful. It lets the model alternate between language and environment.

The minimal loop

MAX_STEPS = 6
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_query},
]
for step in range(MAX_STEPS):
    response = call_model(messages=messages, tools=TOOLS)
    if response.final_text:
        return response.final_text
    for tool_call in response.tool_calls:
        tool_result = execute_tool(tool_call)
        messages.append(tool_call.as_message())
        messages.append(tool_result.as_message())
return "I could not complete this safely."
This code looks small. Its behavior is not small. The whole game sits inside execute_tool, MAX_STEPS, and what you store in messages.

The full picture

user request
model reads request + prior observations + tools
think: what is the smallest useful next action?
act: call tool or answer directly
observe: parse tool output
state update
stop?
  ↙      ↘
yes       no
↓         ↓
answer     next ReAct step

2.2 Why the loop works

It works for four boring reasons. Boring reasons build great systems.

Reason 1: It breaks large uncertainty into small uncertainty

"Solve the whole task now" is hard. "Find the ticket status first" is easier. "Now check the billing plan" is easier. "Now draft the answer" is easier. Agents reduce cognitive load per step.

Reason 2: It grounds the model in fresh observations

The model's pretraining is old. Your database is current. Your filesystem is current. Your ticket state is current. Every observation pulls the model back toward reality.

Reason 3: It supports recovery

If a tool fails, the model can react. It can retry. It can choose another tool. It can ask a clarifying question. It can escalate. Without the loop, one failure often ends the run.

Reason 4: It creates inspectable traces

When an agent fails, you can inspect the loop. You can ask: - Did it pick the wrong tool? - Did the tool return a bad result? - Did the model ignore the result? - Did it stop too early? - Did it continue too long? This makes debugging possible. Not easy. But possible.

A small support example

User says, "My invoice is wrong and I cannot log in." A decent ReAct trace may look like this:

Thought: This needs account lookup and recent invoice details.
Action: lookup_account(user_id="u_145")
Observation: account is active; last login failed due to MFA reset
Thought: Need billing details too.
Action: lookup_invoice(invoice_id="inv_882")
Observation: duplicate charge detected; refund not issued yet
Thought: High-impact billing issue. Escalate and explain.
Action: escalate_to_human(category="billing", summary="duplicate charge + login issue")
Observation: escalation accepted, ticket #T-9921
Final answer: explain login fix, mention billing escalation, share ticket number
Now you see why one-shot prompting loses. The task changes shape after each observation.

2.3 Loop implementation patterns

There are three common implementation styles.

Pattern A: Manual loop

You write the while-loop yourself. Pros: - maximum control - easy to log every step - easiest way to learn Cons: - you must manage message state - retries are your job - stop logic is your job This is the best learning path.

Pattern B: Framework-managed loop

Frameworks like LangGraph, agent SDKs, and orchestration libraries can manage the loop. Pros: - quicker to ship - built-in state graphs - built-in middleware hooks Cons: - easier to hide mistakes - harder to know where the bug lives - abstractions feel magical until they break Use frameworks. But first understand the raw loop.

Pattern C: Hybrid loop

The framework manages messaging. You still own tool execution, state design, and guardrails. This is where most production teams land.

The state you usually need

At minimum, track these things:

from dataclasses import dataclass, field
from typing import Any
@dataclass
class AgentState:
    user_query: str
    step_count: int = 0
    tool_history: list[dict[str, Any]] = field(default_factory=list)
    notes: list[str] = field(default_factory=list)
    total_cost_usd: float = 0.0
    done: bool = False
This is not glamorous. This is adult engineering. If you do not track state explicitly, you will later guess why the agent failed. Guessing is expensive.

The tool execution boundary

The model should not directly call Python functions. Your application should mediate.

def execute_tool(tool_call):
    name = tool_call["name"]
    args = tool_call["arguments"]
    validated_args = validate_arguments(name, args)
    result = TOOL_REGISTRY[name](**validated_args)
    return normalize_tool_result(name, result)
That boundary is important. It is where validation, authorization, logging, and error normalization belong.

2.4 ReAct vs chain-of-thought alone

Chain-of-thought alone means, "Think harder inside the model." ReAct means, "Think, then interact with the world, then think again." The difference is not cosmetic.

Chain-of-thought alone is good for:

  • planning
  • decomposition
  • internal reasoning
  • summarization
  • choosing between options already in context

ReAct is needed for:

  • fetching current facts
  • exact computation
  • reading databases
  • executing side effects
  • verifying side effects
  • multi-step workflows with external state

Short comparison table

Pattern Can use fresh external facts? Can verify side effects? Typical failure
Chain-of-thought only No No plausible hallucination
Single tool call Once Rarely stops too early
ReAct loop Yes Yes loops badly if guardrails are weak
### Example: the same question under both patterns
Question: "What is the customer's current plan, and does the refund policy allow a full refund?" Chain-of-thought alone can only reason over what is already in context. If the plan is not in context, it must guess or refuse. ReAct can:
1. read the customer record
2. read the policy document
3. combine them
4. answer with evidence
That is the practical edge.
### One more subtle point
ReAct is not always better. If the task is, "Rewrite this paragraph more clearly," then tools are unnecessary. If the task is, "Calculate payroll using latest attendance logs and tax tables," then tools are necessary. Use the cheapest pattern that achieves reliability. That sentence will save you money.
---
## Chapter 3: Tool design
### 3.1 Tools are interfaces, not wishes
Beginners design tools like prayers. do_everything() handle_user_request() database_tool() These names are useless. A tool is not for your convenience alone. A tool is also for model selection. The model reads the tool name, description, and schema. Then it guesses, "Is this the right action?" So tool design is partly API design, and partly prompt design. That is the trick.
### Good tool design rules
Rule Why it matters
--- ---
Narrow scope Easier for the model to pick correctly
Specific verb lookup_invoice beats billing_tool
Clear description The model uses descriptions as routing hints
Typed schema Prevents argument drift
Structured errors Lets the model recover intelligently
Idempotency where possible Duplicate calls do not create duplicate damage
Explicit side-effect level Helps decide whether human approval is needed
### Bad vs good examples
Bad:
def customer_tool(payload: dict) -> dict:
    ...
The model must now guess:
- what this tool really does
- what keys belong in payload
- whether it reads or writes
- whether it is safe to retry
Good:
def lookup_account(user_id: str) -> dict:
    ...
def search_kb(query: str, product_area: str) -> dict:
    ...
def escalate_to_human(category: str, summary: str) -> dict:
    ...
Now the intent is visible.
### 3.2 Schema definition
A tool schema tells the model what arguments exist, which are required, and what values are valid. Think of schema as lane markings on a road. Without lane markings, vehicles still move. Crashes just increase.
### JSON-style tool schema
TOOLS = [
    {
        "name": "lookup_account",
        "description": (
            "Read-only lookup for one customer's account status, plan, "
            "and recent authentication flags. Use when the user asks "
            "about account state or login issues."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "user_id": {"type": "string"}
            },
            "required": ["user_id"]
        }
    }
]
Two things guide selection here. First, the name is precise. Second, the description says when to use it, not only what it returns. That second part is underrated.
### Descriptions should answer four questions
1. What does this tool do?
2. When should the model use it?
3. When should the model avoid it?
4. Is it read-only or side-effectful?
### Example of a stronger description
{
    "name": "create_refund_request",
    "description": (
        "Creates a refund request for an existing paid invoice. "
        "Use only after invoice lookup confirms eligibility. "
        "Do not use for policy questions or speculative refunds. "
        "This tool has side effects and may create duplicate requests if misused."
    ),
    ...
}
See how much routing help the model gets.
### 3.3 Descriptions guide selection
Tool choice is a classification problem in disguise. The model is matching the current sub-task against tool descriptions. If descriptions are vague, routing becomes vague. If descriptions overlap heavily, routing becomes noisy. That is why five narrow tools often beat one mega-tool.
### A small comparison
#### Vague set
- customer_tool
- support_tool
- admin_tool
#### Clear set
- lookup_account
- lookup_invoice
- search_help_center
- reset_mfa
- escalate_to_human
Which set would you trust a model to pick from? The second set. Always.
### Tool overlap is dangerous
Suppose you expose both tools below.
search_docs(query: str)
search_policy_docs(query: str)
If policy docs are a subset of docs, the model must guess the boundary. You created ambiguity. Either merge them cleanly, or distinguish them sharply. For example:
search_help_center(query: str)
search_legal_policy(query: str)
Now the domains are clearer.
### 3.4 Error handling and retries
Tools fail. Networks fail. Validation fails. Databases timeout. Users provide wrong IDs. If your tool throws raw stack traces, your model receives chaos. Return structured errors instead.
def lookup_invoice(invoice_id: str) -> dict:
    record = db.find(invoice_id)
    if record is None:
        return {
            "ok": False,
            "error_code": "INVOICE_NOT_FOUND",
            "message": f"No invoice exists for id={invoice_id}",
            "retryable": False,
        }
    return {
        "ok": True,
        "invoice": record,
    }
Now the model can reason.
- retryable=False means do not keep hammering.
- error_code=INVOICE_NOT_FOUND means ask user for the right ID.
That is much better than, "KeyError on line 82."
### Retry logic belongs in the application layer
The model can decide whether to retry. The application should decide how retries work. For example:
def execute_with_retry(fn, *args, **kwargs):
    for attempt in range(3):
        result = fn(*args, **kwargs)
        if result.get("ok"):
            return result
        if not result.get("retryable"):
            return result
    return {
        "ok": False,
        "error_code": "RETRY_EXHAUSTED",
        "message": "Tool failed after 3 attempts",
        "retryable": False,
    }
This keeps retry storms under control.
### 3.5 Idempotency
Idempotency means, "Calling this again does not create a second disaster." Read tools are naturally idempotent.
- search_kb
- lookup_account
- get_weather
Write tools are often not idempotent.
- send_email
- charge_card
- create_ticket
- issue_refund
Agents can and do repeat calls. So write tools need protection.
### Three common idempotency patterns
#### Pattern 1: Client request ID
def create_ticket(subject: str, description: str, request_id: str) -> dict:
    ...
If the same request_id arrives twice, return the original ticket. Do not create a new one.
#### Pattern 2: Read-before-write
Before creating a new refund, check whether a refund already exists.
#### Pattern 3: Human approval on high-risk writes
If the effect is expensive, make duplicates impossible through approval gates.
### 3.6 Pydantic schemas
Pydantic is very useful here. It gives you:
- field types
- validation
- defaults
- enums
- descriptions
- clean JSON schemas
### Example: support agent tools
from typing import Literal
from pydantic import BaseModel, Field
class SearchKBArgs(BaseModel):
    query: str = Field(
        ...,
        description="Question or symptom to search in the help center."
    )
    product_area: Literal["billing", "auth", "workspace", "general"] = Field(
        ...,
        description="Which product area best matches the user's issue."
    )
class LookupAccountArgs(BaseModel):
    user_id: str = Field(
        ...,
        description="Stable user identifier like u_145."
    )
class EscalateArgs(BaseModel):
    category: Literal["billing", "outage", "security", "legal"] = Field(
        ...,
        description="Escalation queue to notify."
    )
    summary: str = Field(
        ...,
        min_length=20,
        description="Compact explanation of the issue and attempted steps."
    )
Very nice. Very readable. Very hard for the model to misunderstand.
### Tool wrapper with validation
SCHEMAS = {
    "search_kb": SearchKBArgs,
    "lookup_account": LookupAccountArgs,
    "escalate_to_human": EscalateArgs,
}
def validate_arguments(tool_name: str, raw_args: dict) -> dict:
    schema = SCHEMAS[tool_name]
    parsed = schema.model_validate(raw_args)
    return parsed.model_dump()
If validation fails, convert it into a structured tool result. Do not crash the loop blindly.
def safe_execute(tool_name: str, raw_args: dict) -> dict:
    try:
        args = validate_arguments(tool_name, raw_args)
    except Exception as exc:
        return {
            "ok": False,
            "error_code": "VALIDATION_ERROR",
            "message": str(exc),
            "retryable": False,
        }
    return TOOL_REGISTRY[tool_name](**args)
Now the agent sees the failure cleanly.
### One subtle design rule
Schema cannot fix bad ontology. If your categories are wrong, validation just makes the wrong shape cleaner. So choose arguments that reflect real decisions. Bad:
- mode
- type
- option
Better:
- invoice_id
- refund_reason
- severity
- requires_manager_approval
Name the world, not your implementation convenience.
---
## Chapter 4: Failure modes & guardrails
### 4.1 Infinite loops
The most famous agent failure is simple. It keeps going. It searches, then searches again, then slightly rephrases, then searches again, then apologizes, then searches again. This burns tokens, latency, and user trust.
### Why loops happen
- tool returns weak results
- model does not know when enough is enough
- stop condition is vague
- failure signal is ambiguous
- tool descriptions encourage overuse
### Minimum protection
Always set a max-step cap.
MAX_STEPS = 6
That cap is not optional. It is your seatbelt.
### Better protection: reason-specific stopping
Stop when any of these is true:
- answer is ready
- required tool failed permanently
- human approval is required
- budget is exhausted
- repeated observation pattern detected
### Example stop function
def should_stop(state: AgentState) -> bool:
    if state.step_count >= 6:
        return True
    if state.total_cost_usd > 0.25:
        return True
    if repeated_same_call(state.tool_history):
        return True
    return state.done
This is the give-up rule in code.
### 4.2 Wrong tool selection
Sometimes the agent picks the wrong tool. Not because it is stupid. Because your tool menu is confusing. Common cases:
- search tool chosen instead of direct lookup
- write tool chosen before validation
- general web search chosen instead of internal policy docs
- escalation chosen too early
### How to reduce wrong selection
1. Improve tool names
2. Improve descriptions
3. Remove overlapping tools
4. Add examples in the system prompt
5. Route first,
then expose only a smaller tool subset That fifth trick is powerful. If the user asks about billing, why expose five engineering tools at all?
### 4.3 Hallucinated arguments
The model may choose the right tool, but invent the wrong arguments. Examples:
- fake invoice ID
- made-up product area
- unsupported enum value
- empty summary for escalation
This is why validation matters. Hallucinated arguments are not rare edge cases. They are default behavior under pressure.
### A safer pattern
If required arguments are missing, do not let the tool guess. Have the model ask the user. For instance:
User: Please refund my last invoice.
Agent: I can help, but I need the invoice ID or email on the account.
That is much better than, "Let me refund invoice inv_999."
### 4.4 Stopping rules and give-up rules
A strong agent knows when to stop. A mature product knows when to give up. These are related, but different.
#### Stop rule
"I have enough information. I can answer now."
#### Give-up rule
"I do not have enough confidence, or I hit a cap, or this action is too risky." You need both.
### Confidence thresholds
Confidence is slippery. Models are overconfident. So do not rely only on self-reported confidence. Combine signals instead:
- number of successful tool observations
- whether required fields are present
- whether retrieved evidence agrees
- whether tool errors remain unresolved
- whether the action is high risk
### A simple policy table
Situation Preferred action
--- ---
Exact answer with verified evidence answer directly
Missing required ID ask user a clarifying question
Write action above risk threshold human approval gate
Repeated retryable failure apologize and escalate
Step cap reached return partial progress + safe next step
### 4.5 Human-in-the-loop gates
Do not let the agent freely perform high-risk writes. Good HITL candidates:
- refunds
- account deletion
- contract changes
- legal communication
- sending emails outside the company
- production infrastructure changes
### ASCII picture
user request
agent plans action
high-risk tool?
  ↙         ↘
yes          no
↓            ↓
request      execute tool
approval         ↓
↓                observe
approved?        ↓
↙     ↘       continue loop
no      yes
↓        ↓
explain   execute tool
This pattern feels slower. It is safer. And safety is part of product quality.
### 4.6 Cost controls
Agents are expensive in two ways. First, they make many model calls. Second, they often cause long prompts to be resent.
### Where cost comes from
- repeated tool loops
- growing message history
- expensive model for trivial routing
- large retrieved context every step
- unnecessary parallel tool fan-out
### Common cost controls
1. Step caps
2. Budget caps
3. Cheaper router model
4. Prompt caching
5. Observation summarization
6. Small tool subset per route
7. Parallelize only read-only work
### Latency is part of cost
Users experience time, not token elegance. A five-step agent that answers in nine seconds may be worse than a one-shot answer in one second, if accuracy barely improves. So the right metric is not, "Did the agent feel smart?" It is, "Did the extra steps improve outcome enough to justify cost and latency?"
### Logging is the hidden guardrail
If you log these fields, you can actually debug production:
- user query
- tool calls
- raw arguments
- validation results
- tool outputs
- step count
- latency per step
- tokens per step
- total cost
- final outcome label
Without logs, you are storytelling. With logs, you are engineering.
---
## Chapter 5: Advanced patterns
### 5.1 Parallel tool calls
Sometimes the next best step is not one tool. It is several independent read tools. Example:
- read account status
- read last invoice
- search status page
These can happen together, if they do not depend on each other.
### ASCII fan-out
           /→ lookup_account ──\\
user issue → lookup_invoice  ───→ join observations → answer
           \\→ search_status  ──/
Parallelization reduces latency. It does not reduce complexity. You still need:
- per-tool timeouts
- argument validation
- merge logic
- partial failure handling
### Example with asyncio.gather
import asyncio
async def execute_parallel(calls):
    tasks = [async_execute(call) for call in calls]
    return await asyncio.gather(*tasks, return_exceptions=True)
Use this mainly for read-only tools. Parallel writes are much riskier.
### When not to parallelize
Do not parallelize when:
- step 2 depends on step 1's output
- two tools may race on the same record
- side effects must happen in order
- you cannot merge conflicting observations cleanly
Parallelism is a performance tool. Not a maturity badge.
### 5.2 Tool chaining
Tool chaining means, one tool's result feeds the next tool. Example chain:
search_help_center
read_best_article
summarize_fix_steps
create_reply_draft
Or, for an operations agent:
search_logs
identify_failing_service
read_runbook
open_incident
Chains are common. They are also where hidden assumptions live.
### The chain must remain inspectable
If you compress the whole chain into one mega-tool, you lose visibility. If you split every tiny micro-step into a tool, you lose efficiency. So choose meaningful boundaries. A good question is: "Where would I want logs if this failed?" That often tells you the right tool boundary.
### 5.3 Dynamic tool selection
Exposing all tools to all queries is rarely optimal. Dynamic tool selection means, choosing a smaller relevant tool menu first.
### Simple routing example
def select_tool_subset(user_query: str) -> list[dict]:
    if looks_like_billing(user_query):
        return BILLING_TOOLS
    if looks_like_auth(user_query):
        return AUTH_TOOLS
    return GENERAL_TOOLS
This can be rule-based. It can be classifier-based. It can be another model. The point is not sophistication. The point is reducing confusion.
### Dynamic selection also helps safety
If the user asks a documentation question, you can hide all write tools. That alone prevents many accidents.
### 5.4 Memory and state across turns
Single-turn demos are easy. Real users come back. They ask follow-ups. They change goals. State now matters. You usually need three layers.
#### Layer 1: Short-term conversation state
What happened in the current run?
- user goal
- recent tool observations
- unresolved questions
- current plan
#### Layer 2: Structured working state
This is not raw chat history. This is clean machine state.
from pydantic import BaseModel
class SupportState(BaseModel):
    user_id: str | None = None
    invoice_id: str | None = None
    account_status: str | None = None
    escalation_ticket: str | None = None
    pending_questions: list[str] = []
Structured state is much easier to reuse than prose.
#### Layer 3: Long-term memory
Only store what is worth remembering. Examples:
- stable user preferences
- prior successful resolutions
- account facts that are safe to retain
- summary of previous long session
Do not dump everything forever. That becomes noise, and sometimes risk.
### A useful rule
Messages are for conversation. State is for control. Memory is for continuity. Do not mix all three blindly.
### 5.5 Retrieval prompts
These are good prompts to keep handy. They help an agent retrieve only what matters.
#### Retrieval prompt 1 — choose the next tool
You are selecting the next tool for a support agent.
Retrieve only the minimum policy or account context needed
for deciding the next safe action.
Prefer exact policy passages over summaries.
Return no more than 5 short snippets.
Use this when the agent is uncertain which tool to call next.
#### Retrieval prompt 2 — prepare for a write action
Before any write action, retrieve the latest state of the entity,
all approval constraints, and any recent conflicting actions.
If evidence is missing, say what is missing clearly.
Use this before refunds, updates, or ticket creation.
#### Retrieval prompt 3 — compress previous work
Summarize the previous turns into structured working state.
Keep unresolved questions, verified facts, failed tool calls,
and the latest safe next step.
Remove repetition.
Use this when conversation history is growing.
#### Retrieval prompt 4 — debug a failed run
Retrieve past runs with the same error code,
the last successful tool before failure,
and the smallest policy difference between success and failure.
Use this for ops and evaluation workflows. These prompts matter because retrieval is not just search. It is search with intent.
### 5.6 Honest admission
Agents are brittle. They are hard to debug. They are non-deterministic. They fail in ways that look intelligent. A demo can work ten times, and fail on the eleventh for a tiny wording change. That is normal. Not pleasant. But normal. Please keep three truths together.
#### Truth 1
Agents can unlock workflows that one-shot LLM calls cannot handle well.
#### Truth 2
Most agent failures are systems problems, not purely model problems. Bad schemas, bad prompts, bad retries, bad state design, and bad stop rules cause a lot of pain.
#### Truth 3
A workflow should not become agentic by default. If prompt + retrieval solves the problem, stop there. If a deterministic backend workflow solves it, stop there. Use agents when uncertainty and branching are real. This honesty is senior behavior.
### 5.7 Foundation-gap audit for module 10
Module 10 assumes you already own these basics. If any item below feels fuzzy, fix it now.
#### Audit checklist
- Single-agent loop mechanics
- Can you explain think → act → observe without notes?
- Can you code a basic loop manually?
- Tool schema design
- Can you name fields,
enums, and descriptions that reduce misuse?
- Error handling in loops
- Can you describe validation errors,
retryable errors, and terminal errors?
- When to stop / give up
- Can you explain max-step caps,
budget caps, and escalation thresholds?
- State across turns
- Can you separate message history,
structured state, and memory? If these are weak, MCP and multi-agent orchestration will feel fancy but hollow.
### 5.8 Bridge to next module
Next module — 16_multi_agent_coordination — scales this pattern. Multiple agents coordinating. Standardized tool protocols (MCP). Orchestration patterns for complex workflows. See the order clearly. First, one agent learns to think, call, observe, and stop safely. Then, multiple agents coordinate. First, you design one clean tool schema. Then, you think about standardized tool protocols. First, you manage state in one loop. Then, you manage state across many loops. That is why module 09 comes first.
---
## Chapter 6: Recap & application
### 6.1 The failure-fix table
Visual recap. Every row is one common failure, and the corresponding fix.
Failure Fix
--- ---
Model uses a tool once, then hallucinates later steps Use a ReAct loop that checks every critical step
Agent keeps searching forever Add max-step caps and repeated-call detection
Agent picks the wrong tool Narrow tool scopes and improve descriptions
Agent invents arguments Validate with strict schemas
Tool crashes leak raw exceptions Return structured error objects
Duplicate side effects happen Add idempotency keys and approval gates
Prompt grows too large Summarize observations into structured state
Expensive model handles trivial routing Use a smaller router or rule-based front door
Parallel calls create conflicts Parallelize only independent read tools
Agent answers without evidence Require observations before high-confidence answers
Write action is risky Add human-in-the-loop approval
Sketch this table from memory. That alone will reveal your weak spots.
### 6.2 Key points to remember
- A tool call is not an agent.
- The loop is the product.
- ReAct means think,
act, observe, repeat.
- Tool names and descriptions are routing signals.
- Schemas reduce argument hallucination.
- Structured errors make recovery possible.
- Idempotency matters because agents repeat themselves.
- Stop rules and give-up rules are different.
- State should be explicit,
not accidental.
- Parallelism helps latency only when dependencies allow it.
- Agents are useful,
but they are not the default answer.
### 6.3 Important interview questions
Q1. What is the difference between tool calling and an agent? Tool calling can be one step. An agent is a loop with state, observation, and stopping logic. Q2. Why do narrow tools outperform one mega-tool? Because tool selection is a classification problem. Narrow tools reduce ambiguity, argument confusion, and unsafe overlap. Q3. Why is structured error handling important? Because the model can recover only if failure shape is legible. Raw exceptions are bad observations. Typed errors are useful observations. Q4. When should you add human approval? When the action is high risk, expensive, irreversible, or externally visible. Q5. How do you stop an agent safely? Use step caps, budget caps, repeated-failure detection, and task-specific completion checks. Q6. When is chain-of-thought alone enough? When the task is internal reasoning over already-present context, with no need for fresh external state or side effects. Q7. What usually breaks first in production agents? Tool design, not frontier-model intelligence. Bad schemas, weak descriptions, and missing guardrails usually show up early.
### 6.4 Production experience: latency, cost, and debugging
In production, you will care about three numbers fast.
1. success rate
2. latency
3. cost per task
The mistake is optimizing only one. A very cheap agent that fails is useless. A very accurate agent that takes twenty seconds is often useless. A very fast agent that issues duplicate refunds is dangerous.
### Practical lessons
#### Lesson 1: Measure per step
Do not log only total latency. Log latency per tool and per model call. That reveals whether search, retrieval, or the model is the bottleneck.
#### Lesson 2: Cap the blast radius
Early versions should have:
- read-heavy toolsets
- low step caps
- narrow user cohorts
- human review on writes
Expand after evidence, not before.
#### Lesson 3: Evaluate the trace,
not only the final answer Sometimes the final answer looks correct, but the path was unsafe. That path will later fail on a harder case. So evaluate:
- correct tool chosen?
- correct order?
- correct arguments?
- correct stop behavior?
- correct escalation?
#### Lesson 4: Budget for retries consciously
Retries are not free. A retry policy can quietly double cost. If retries exist, label them. Track them. Audit them.
#### Lesson 5: Your best debugging artifact is a bad trace library
Keep examples of ugly failures.
- loop storms
- wrong-tool picks
- good final answer via bad intermediate state
- duplicate writes
- fake IDs
These traces become gold for prompts, regression tests, and interviews.
### 6.5 Apply now — graded exercises
#### Easy (5 minutes)
For each user request below, write the next best action. Do not solve the whole task. Just write the next action.
1. "I think I was charged twice."
2. "How do I reset MFA?"
3. "Please delete my company workspace right now."
Check yourself. Did you choose a tool, a clarifying question, or a human gate correctly?
#### Medium (15 minutes)
Design three tool schemas for a support agent. Requirements:
- one read-only lookup tool
- one knowledge-base search tool
- one escalation tool
For each, write:
- tool name
- description
- arguments
- one structured error example
- whether it is idempotent
If your tool names still sound vague, redo them.
#### Hard (30 minutes)
Implement a tiny manual ReAct loop. You may mock the tools.
TOOLS = {
    "calculator": lambda expression: {"ok": True, "result": eval(expression)},
    "lookup_price": lambda sku: {"ok": True, "price": 19.99},
}
Now support this flow:
- ask user for a multi-step shopping calculation
- force tool use for each exact arithmetic step
- stop after MAX_STEPS = 5
- log every observation
Then intentionally break one tool. Confirm the agent exits safely.
#### Very hard (45 minutes)
Take your Week 9 hands_on_lab plan. Design the evaluation sheet before implementation. Include columns for:
- query type
- expected tool sequence
- expected escalation or refusal
- correctness of final answer
- latency bucket
- notes on failure mode
This is how mature agent work begins.
### 6.6 Final retrieval
Without looking, answer these from memory.
1. What are the five named placeholders from the handyman analogy?
2. Why is one good tool call not enough?
3. What are the three ReAct steps?
4. Give one good tool description,
and one bad one.
5. Name three guardrails.
6. When should an agent give up?
7. What five basics does module 10 assume from module 09?
If you can answer these cleanly, you are ready. Then open 16_multi_agent_coordination. You will see the scaling story much more clearly.