02. Agents & Tool Calling — Narrative Explainer¶
Companion to 03_study_material.md. The study material gives the terms. This file gives the moving picture.¶
Table of contents¶
- ELI5 — the handyman with a toolbelt (start here)
- Chapter 1: The first production embarrassment
- 1.1 One good tool call is not an agent
- 1.2 Why the stakes are higher now
- Chapter 2: The ReAct loop
- 2.1 Think → Act → Observe
- 2.2 Why the loop works
- 2.3 Loop implementation patterns
- 2.4 ReAct vs chain-of-thought alone
- Chapter 3: Tool design
- 3.1 Tools are interfaces, not wishes
- 3.2 Schema definition
- 3.3 Descriptions guide selection
- 3.4 Error handling and retries
- 3.5 Idempotency
- 3.6 Pydantic schemas
- Chapter 4: Failure modes & guardrails
- 4.1 Infinite loops
- 4.2 Wrong tool selection
- 4.3 Hallucinated arguments
- 4.4 Stopping rules and give-up rules
- 4.5 Human-in-the-loop gates
- 4.6 Cost controls
- Chapter 5: Advanced patterns
- 5.1 Parallel tool calls
- 5.2 Tool chaining
- 5.3 Dynamic tool selection
- 5.4 Memory and state across turns
- 5.5 Retrieval prompts
- 5.6 Honest admission
- 5.7 Foundation-gap audit for module 10
- 5.8 Bridge to next module
- Chapter 6: Recap & application
- 6.1 The failure-fix table
- 6.2 Key points to remember
- 6.3 Important interview questions
- 6.4 Production experience
- 6.5 Apply now — graded exercises
- 6.6 Final retrieval
ELI5 — the handyman with a toolbelt¶
Imagine a handyman visiting your house. You say, "My sink is leaking." The handyman does not randomly swing a hammer. He first looks. He thinks. He checks which tool fits the problem. That is an agent. Keep these names in your head: - the toolbelt = the available tools - the think step = reasoning and planning - the try = the tool call - the check = reading the result and judging it - the give-up rule = max iterations or a confidence threshold Now picture the full loop.
problem arrives
↓
think step
↓
pick from the toolbelt
↓
try
↓
check
↓
worked?
↙ ↘
yes no
↓ ↓
finish think again
↓
give-up rule?
↙ ↘
yes no
↓ ↓
call specialist try again
Chapter 1: The first production embarrassment¶
1.1 One good tool call is not an agent¶
You give an LLM access to a calculator tool. You test it with a simple prompt. "What is 37 × 18?" The model calls the calculator correctly. You feel smart. You feel safe. You think, "Done. Tool calling works." Then the real query arrives. "A warehouse has 17 boxes. Each box has 24 batteries. 13 batteries are defective. The rest are packed into cartons of 7. How many full cartons can we ship?" This problem has multiple steps.
1. Multiply 17 × 24
2. Subtract 13
3. Divide by 7
4. Take the floor
Now watch the embarrassing trace.
User:
A warehouse has 17 boxes. Each box has 24 batteries.
13 batteries are defective. The rest are packed into cartons of 7.
How many full cartons can we ship?
Assistant tool call:
calculator({"expression": "17 * 24"})
Tool result:
408
Assistant final answer:
After removing 13 defective batteries, 398 remain.
That makes 56 full cartons.
408 - 13 is not 398. It is 395. Second, 395 / 7 is 56 remainder 3. The answer happened to stay 56. So the first wrong number got hidden. This is the worst kind of failure. The final answer looks plausible. The stepwise state is broken. One correct tool call gave you false confidence. The model used the tool for step 1, then hallucinated step 2. This happens constantly. Why? Because tool access alone is not enough. The model needs a policy. It needs a loop. It needs a habit of checking. A single round of tool calling says, "Here is one moment when the model can act." An agent says, "Keep acting and observing until the task is truly done." That difference is everything.
Tiny log, big lesson¶
Let us write the same flow properly.
User asks multi-step math question
↓
Model thinks: I need exact arithmetic for every step
↓
Tool call 1: 17 * 24
↓
Observe: 408
↓
Tool call 2: 408 - 13
↓
Observe: 395
↓
Tool call 3: 395 // 7
↓
Observe: 56
↓
Answer: 56 full cartons
Why this failure generalizes¶
Math is the easy example. The same failure shows up everywhere. - The agent searches a knowledge base once, then invents a second fact. - The agent reads the customer profile, then guesses the subscription tier. - The agent checks the calendar, then hallucinates available slots. - The agent creates a draft email, then claims it already sent it. If the next step depends on external truth, you must usually check again. That is the deeper principle. Agents exist because the world is outside the model.
1.2 Why the stakes are higher now¶
For 2025-2026, agents are the #1 production pattern. That sentence is not hype. It is workflow economics. Why are teams shipping agents? Because many valuable tasks are not one-shot generation tasks. They are: - look something up - compare two sources - calculate something exactly - write into a system - verify the write succeeded - ask a follow-up question - escalate when confidence is low That is a loop. Customer support is a loop. Operations is a loop. Data cleanup is a loop. Developer assistants are loops. Research assistants are loops. The model is strong at language. The tools are strong at reality. The loop is what marries them. Without the loop, you get demo reliability. With the loop, you at least have a chance at production reliability. Still, please do not romanticize agents. An agent is not magic. It is just careful iteration around uncertainty. That is why the basics matter. If you do not understand the loop, module 10 will feel decorative. MCP, multi-agent systems, and orchestration layers all assume one thing first: that you can run one agent safely. So this module is not optional groundwork. It is the foundation.
Chapter 2: The ReAct loop¶
2.1 Think → Act → Observe¶
The standard agent loop is called ReAct. It comes from a very simple insight. Reasoning alone is not enough. Acting alone is not enough. You need both, interleaved. The canonical pattern is: 1. Think — decide what to do next 2. Act — call a tool or produce an answer 3. Observe — read the result 4. Repeat until done In many papers, this is written as Thought → Action → Observation. In practical software, you should think of it as:
plan a tiny next move
↓
do the move
↓
read what actually happened
↓
update your internal state
↓
plan the next move
The minimal loop¶
MAX_STEPS = 6
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_query},
]
for step in range(MAX_STEPS):
response = call_model(messages=messages, tools=TOOLS)
if response.final_text:
return response.final_text
for tool_call in response.tool_calls:
tool_result = execute_tool(tool_call)
messages.append(tool_call.as_message())
messages.append(tool_result.as_message())
return "I could not complete this safely."
execute_tool, MAX_STEPS, and what you store in messages.
The full picture¶
user request
↓
model reads request + prior observations + tools
↓
think: what is the smallest useful next action?
↓
act: call tool or answer directly
↓
observe: parse tool output
↓
state update
↓
stop?
↙ ↘
yes no
↓ ↓
answer next ReAct step
2.2 Why the loop works¶
It works for four boring reasons. Boring reasons build great systems.
Reason 1: It breaks large uncertainty into small uncertainty¶
"Solve the whole task now" is hard. "Find the ticket status first" is easier. "Now check the billing plan" is easier. "Now draft the answer" is easier. Agents reduce cognitive load per step.
Reason 2: It grounds the model in fresh observations¶
The model's pretraining is old. Your database is current. Your filesystem is current. Your ticket state is current. Every observation pulls the model back toward reality.
Reason 3: It supports recovery¶
If a tool fails, the model can react. It can retry. It can choose another tool. It can ask a clarifying question. It can escalate. Without the loop, one failure often ends the run.
Reason 4: It creates inspectable traces¶
When an agent fails, you can inspect the loop. You can ask: - Did it pick the wrong tool? - Did the tool return a bad result? - Did the model ignore the result? - Did it stop too early? - Did it continue too long? This makes debugging possible. Not easy. But possible.
A small support example¶
User says, "My invoice is wrong and I cannot log in." A decent ReAct trace may look like this:
Thought: This needs account lookup and recent invoice details.
Action: lookup_account(user_id="u_145")
Observation: account is active; last login failed due to MFA reset
Thought: Need billing details too.
Action: lookup_invoice(invoice_id="inv_882")
Observation: duplicate charge detected; refund not issued yet
Thought: High-impact billing issue. Escalate and explain.
Action: escalate_to_human(category="billing", summary="duplicate charge + login issue")
Observation: escalation accepted, ticket #T-9921
Final answer: explain login fix, mention billing escalation, share ticket number
2.3 Loop implementation patterns¶
There are three common implementation styles.
Pattern A: Manual loop¶
You write the while-loop yourself. Pros: - maximum control - easy to log every step - easiest way to learn Cons: - you must manage message state - retries are your job - stop logic is your job This is the best learning path.
Pattern B: Framework-managed loop¶
Frameworks like LangGraph, agent SDKs, and orchestration libraries can manage the loop. Pros: - quicker to ship - built-in state graphs - built-in middleware hooks Cons: - easier to hide mistakes - harder to know where the bug lives - abstractions feel magical until they break Use frameworks. But first understand the raw loop.
Pattern C: Hybrid loop¶
The framework manages messaging. You still own tool execution, state design, and guardrails. This is where most production teams land.
The state you usually need¶
At minimum, track these things:
from dataclasses import dataclass, field
from typing import Any
@dataclass
class AgentState:
user_query: str
step_count: int = 0
tool_history: list[dict[str, Any]] = field(default_factory=list)
notes: list[str] = field(default_factory=list)
total_cost_usd: float = 0.0
done: bool = False
The tool execution boundary¶
The model should not directly call Python functions. Your application should mediate.
def execute_tool(tool_call):
name = tool_call["name"]
args = tool_call["arguments"]
validated_args = validate_arguments(name, args)
result = TOOL_REGISTRY[name](**validated_args)
return normalize_tool_result(name, result)
2.4 ReAct vs chain-of-thought alone¶
Chain-of-thought alone means, "Think harder inside the model." ReAct means, "Think, then interact with the world, then think again." The difference is not cosmetic.
Chain-of-thought alone is good for:¶
- planning
- decomposition
- internal reasoning
- summarization
- choosing between options already in context
ReAct is needed for:¶
- fetching current facts
- exact computation
- reading databases
- executing side effects
- verifying side effects
- multi-step workflows with external state
Short comparison table¶
| Pattern | Can use fresh external facts? | Can verify side effects? | Typical failure |
|---|---|---|---|
| Chain-of-thought only | No | No | plausible hallucination |
| Single tool call | Once | Rarely | stops too early |
| ReAct loop | Yes | Yes | loops badly if guardrails are weak |
| ### Example: the same question under both patterns | |||
| Question: "What is the customer's current plan, and does the refund policy allow a full refund?" Chain-of-thought alone can only reason over what is already in context. If the plan is not in context, it must guess or refuse. ReAct can: | |||
| 1. read the customer record | |||
| 2. read the policy document | |||
| 3. combine them | |||
| 4. answer with evidence | |||
| That is the practical edge. | |||
| ### One more subtle point | |||
| ReAct is not always better. If the task is, "Rewrite this paragraph more clearly," then tools are unnecessary. If the task is, "Calculate payroll using latest attendance logs and tax tables," then tools are necessary. Use the cheapest pattern that achieves reliability. That sentence will save you money. | |||
| --- | |||
| ## Chapter 3: Tool design | |||
| ### 3.1 Tools are interfaces, not wishes | |||
Beginners design tools like prayers. do_everything() handle_user_request() database_tool() These names are useless. A tool is not for your convenience alone. A tool is also for model selection. The model reads the tool name, description, and schema. Then it guesses, "Is this the right action?" So tool design is partly API design, and partly prompt design. That is the trick. |
|||
| ### Good tool design rules | |||
| Rule | Why it matters | ||
| --- | --- | ||
| Narrow scope | Easier for the model to pick correctly | ||
| Specific verb | lookup_invoice beats billing_tool |
||
| Clear description | The model uses descriptions as routing hints | ||
| Typed schema | Prevents argument drift | ||
| Structured errors | Lets the model recover intelligently | ||
| Idempotency where possible | Duplicate calls do not create duplicate damage | ||
| Explicit side-effect level | Helps decide whether human approval is needed | ||
| ### Bad vs good examples | |||
| Bad: | |||
| The model must now guess: | |||
| - what this tool really does | |||
- what keys belong in payload |
|||
| - whether it reads or writes | |||
| - whether it is safe to retry | |||
| Good: | |||
| Now the intent is visible. | |||
| ### 3.2 Schema definition | |||
| A tool schema tells the model what arguments exist, which are required, and what values are valid. Think of schema as lane markings on a road. Without lane markings, vehicles still move. Crashes just increase. | |||
| ### JSON-style tool schema | |||
|
|||
| Two things guide selection here. First, the name is precise. Second, the description says when to use it, not only what it returns. That second part is underrated. | |||
| ### Descriptions should answer four questions | |||
| 1. What does this tool do? | |||
| 2. When should the model use it? | |||
| 3. When should the model avoid it? | |||
| 4. Is it read-only or side-effectful? | |||
| ### Example of a stronger description | |||
|
|||
| See how much routing help the model gets. | |||
| ### 3.3 Descriptions guide selection | |||
| Tool choice is a classification problem in disguise. The model is matching the current sub-task against tool descriptions. If descriptions are vague, routing becomes vague. If descriptions overlap heavily, routing becomes noisy. That is why five narrow tools often beat one mega-tool. | |||
| ### A small comparison | |||
| #### Vague set | |||
- customer_tool |
|||
- support_tool |
|||
- admin_tool |
|||
| #### Clear set | |||
- lookup_account |
|||
- lookup_invoice |
|||
- search_help_center |
|||
- reset_mfa |
|||
- escalate_to_human |
|||
| Which set would you trust a model to pick from? The second set. Always. | |||
| ### Tool overlap is dangerous | |||
| Suppose you expose both tools below. | |||
| If policy docs are a subset of docs, the model must guess the boundary. You created ambiguity. Either merge them cleanly, or distinguish them sharply. For example: | |||
| Now the domains are clearer. | |||
| ### 3.4 Error handling and retries | |||
| Tools fail. Networks fail. Validation fails. Databases timeout. Users provide wrong IDs. If your tool throws raw stack traces, your model receives chaos. Return structured errors instead. | |||
| Now the model can reason. | |||
- retryable=False means do not keep hammering. |
|||
- error_code=INVOICE_NOT_FOUND means ask user for the right ID. |
|||
| That is much better than, "KeyError on line 82." | |||
| ### Retry logic belongs in the application layer | |||
| The model can decide whether to retry. The application should decide how retries work. For example: | |||
| This keeps retry storms under control. | |||
| ### 3.5 Idempotency | |||
| Idempotency means, "Calling this again does not create a second disaster." Read tools are naturally idempotent. | |||
- search_kb |
|||
- lookup_account |
|||
- get_weather |
|||
| Write tools are often not idempotent. | |||
- send_email |
|||
- charge_card |
|||
- create_ticket |
|||
- issue_refund |
|||
| Agents can and do repeat calls. So write tools need protection. | |||
| ### Three common idempotency patterns | |||
| #### Pattern 1: Client request ID | |||
If the same request_id arrives twice, return the original ticket. Do not create a new one. |
|||
| #### Pattern 2: Read-before-write | |||
| Before creating a new refund, check whether a refund already exists. | |||
| #### Pattern 3: Human approval on high-risk writes | |||
| If the effect is expensive, make duplicates impossible through approval gates. | |||
| ### 3.6 Pydantic schemas | |||
| Pydantic is very useful here. It gives you: | |||
| - field types | |||
| - validation | |||
| - defaults | |||
| - enums | |||
| - descriptions | |||
| - clean JSON schemas | |||
| ### Example: support agent tools | |||
|
|||
| Very nice. Very readable. Very hard for the model to misunderstand. | |||
| ### Tool wrapper with validation | |||
| If validation fails, convert it into a structured tool result. Do not crash the loop blindly. | |||
| Now the agent sees the failure cleanly. | |||
| ### One subtle design rule | |||
| Schema cannot fix bad ontology. If your categories are wrong, validation just makes the wrong shape cleaner. So choose arguments that reflect real decisions. Bad: | |||
- mode |
|||
- type |
|||
- option |
|||
| Better: | |||
- invoice_id |
|||
- refund_reason |
|||
- severity |
|||
- requires_manager_approval |
|||
| Name the world, not your implementation convenience. | |||
| --- | |||
| ## Chapter 4: Failure modes & guardrails | |||
| ### 4.1 Infinite loops | |||
| The most famous agent failure is simple. It keeps going. It searches, then searches again, then slightly rephrases, then searches again, then apologizes, then searches again. This burns tokens, latency, and user trust. | |||
| ### Why loops happen | |||
| - tool returns weak results | |||
| - model does not know when enough is enough | |||
| - stop condition is vague | |||
| - failure signal is ambiguous | |||
| - tool descriptions encourage overuse | |||
| ### Minimum protection | |||
| Always set a max-step cap. | |||
| That cap is not optional. It is your seatbelt. | |||
| ### Better protection: reason-specific stopping | |||
| Stop when any of these is true: | |||
| - answer is ready | |||
| - required tool failed permanently | |||
| - human approval is required | |||
| - budget is exhausted | |||
| - repeated observation pattern detected | |||
| ### Example stop function | |||
| This is the give-up rule in code. | |||
| ### 4.2 Wrong tool selection | |||
| Sometimes the agent picks the wrong tool. Not because it is stupid. Because your tool menu is confusing. Common cases: | |||
| - search tool chosen instead of direct lookup | |||
| - write tool chosen before validation | |||
| - general web search chosen instead of internal policy docs | |||
| - escalation chosen too early | |||
| ### How to reduce wrong selection | |||
| 1. Improve tool names | |||
| 2. Improve descriptions | |||
| 3. Remove overlapping tools | |||
| 4. Add examples in the system prompt | |||
| 5. Route first, | |||
| then expose only a smaller tool subset That fifth trick is powerful. If the user asks about billing, why expose five engineering tools at all? | |||
| ### 4.3 Hallucinated arguments | |||
| The model may choose the right tool, but invent the wrong arguments. Examples: | |||
| - fake invoice ID | |||
| - made-up product area | |||
| - unsupported enum value | |||
| - empty summary for escalation | |||
| This is why validation matters. Hallucinated arguments are not rare edge cases. They are default behavior under pressure. | |||
| ### A safer pattern | |||
| If required arguments are missing, do not let the tool guess. Have the model ask the user. For instance: | |||
| That is much better than, "Let me refund invoice inv_999." | |||
| ### 4.4 Stopping rules and give-up rules | |||
| A strong agent knows when to stop. A mature product knows when to give up. These are related, but different. | |||
| #### Stop rule | |||
| "I have enough information. I can answer now." | |||
| #### Give-up rule | |||
| "I do not have enough confidence, or I hit a cap, or this action is too risky." You need both. | |||
| ### Confidence thresholds | |||
| Confidence is slippery. Models are overconfident. So do not rely only on self-reported confidence. Combine signals instead: | |||
| - number of successful tool observations | |||
| - whether required fields are present | |||
| - whether retrieved evidence agrees | |||
| - whether tool errors remain unresolved | |||
| - whether the action is high risk | |||
| ### A simple policy table | |||
| Situation | Preferred action | ||
| --- | --- | ||
| Exact answer with verified evidence | answer directly | ||
| Missing required ID | ask user a clarifying question | ||
| Write action above risk threshold | human approval gate | ||
| Repeated retryable failure | apologize and escalate | ||
| Step cap reached | return partial progress + safe next step | ||
| ### 4.5 Human-in-the-loop gates | |||
| Do not let the agent freely perform high-risk writes. Good HITL candidates: | |||
| - refunds | |||
| - account deletion | |||
| - contract changes | |||
| - legal communication | |||
| - sending emails outside the company | |||
| - production infrastructure changes | |||
| ### ASCII picture | |||
| This pattern feels slower. It is safer. And safety is part of product quality. | |||
| ### 4.6 Cost controls | |||
| Agents are expensive in two ways. First, they make many model calls. Second, they often cause long prompts to be resent. | |||
| ### Where cost comes from | |||
| - repeated tool loops | |||
| - growing message history | |||
| - expensive model for trivial routing | |||
| - large retrieved context every step | |||
| - unnecessary parallel tool fan-out | |||
| ### Common cost controls | |||
| 1. Step caps | |||
| 2. Budget caps | |||
| 3. Cheaper router model | |||
| 4. Prompt caching | |||
| 5. Observation summarization | |||
| 6. Small tool subset per route | |||
| 7. Parallelize only read-only work | |||
| ### Latency is part of cost | |||
| Users experience time, not token elegance. A five-step agent that answers in nine seconds may be worse than a one-shot answer in one second, if accuracy barely improves. So the right metric is not, "Did the agent feel smart?" It is, "Did the extra steps improve outcome enough to justify cost and latency?" | |||
| ### Logging is the hidden guardrail | |||
| If you log these fields, you can actually debug production: | |||
| - user query | |||
| - tool calls | |||
| - raw arguments | |||
| - validation results | |||
| - tool outputs | |||
| - step count | |||
| - latency per step | |||
| - tokens per step | |||
| - total cost | |||
| - final outcome label | |||
| Without logs, you are storytelling. With logs, you are engineering. | |||
| --- | |||
| ## Chapter 5: Advanced patterns | |||
| ### 5.1 Parallel tool calls | |||
| Sometimes the next best step is not one tool. It is several independent read tools. Example: | |||
| - read account status | |||
| - read last invoice | |||
| - search status page | |||
| These can happen together, if they do not depend on each other. | |||
| ### ASCII fan-out | |||
| Parallelization reduces latency. It does not reduce complexity. You still need: | |||
| - per-tool timeouts | |||
| - argument validation | |||
| - merge logic | |||
| - partial failure handling | |||
### Example with asyncio.gather |
|||
| Use this mainly for read-only tools. Parallel writes are much riskier. | |||
| ### When not to parallelize | |||
| Do not parallelize when: | |||
| - step 2 depends on step 1's output | |||
| - two tools may race on the same record | |||
| - side effects must happen in order | |||
| - you cannot merge conflicting observations cleanly | |||
| Parallelism is a performance tool. Not a maturity badge. | |||
| ### 5.2 Tool chaining | |||
| Tool chaining means, one tool's result feeds the next tool. Example chain: | |||
| Or, for an operations agent: | |||
| Chains are common. They are also where hidden assumptions live. | |||
| ### The chain must remain inspectable | |||
| If you compress the whole chain into one mega-tool, you lose visibility. If you split every tiny micro-step into a tool, you lose efficiency. So choose meaningful boundaries. A good question is: "Where would I want logs if this failed?" That often tells you the right tool boundary. | |||
| ### 5.3 Dynamic tool selection | |||
| Exposing all tools to all queries is rarely optimal. Dynamic tool selection means, choosing a smaller relevant tool menu first. | |||
| ### Simple routing example | |||
| This can be rule-based. It can be classifier-based. It can be another model. The point is not sophistication. The point is reducing confusion. | |||
| ### Dynamic selection also helps safety | |||
| If the user asks a documentation question, you can hide all write tools. That alone prevents many accidents. | |||
| ### 5.4 Memory and state across turns | |||
| Single-turn demos are easy. Real users come back. They ask follow-ups. They change goals. State now matters. You usually need three layers. | |||
| #### Layer 1: Short-term conversation state | |||
| What happened in the current run? | |||
| - user goal | |||
| - recent tool observations | |||
| - unresolved questions | |||
| - current plan | |||
| #### Layer 2: Structured working state | |||
| This is not raw chat history. This is clean machine state. | |||
| Structured state is much easier to reuse than prose. | |||
| #### Layer 3: Long-term memory | |||
| Only store what is worth remembering. Examples: | |||
| - stable user preferences | |||
| - prior successful resolutions | |||
| - account facts that are safe to retain | |||
| - summary of previous long session | |||
| Do not dump everything forever. That becomes noise, and sometimes risk. | |||
| ### A useful rule | |||
| Messages are for conversation. State is for control. Memory is for continuity. Do not mix all three blindly. | |||
| ### 5.5 Retrieval prompts | |||
| These are good prompts to keep handy. They help an agent retrieve only what matters. | |||
| #### Retrieval prompt 1 — choose the next tool | |||
| Use this when the agent is uncertain which tool to call next. | |||
| #### Retrieval prompt 2 — prepare for a write action | |||
| Use this before refunds, updates, or ticket creation. | |||
| #### Retrieval prompt 3 — compress previous work | |||
| Use this when conversation history is growing. | |||
| #### Retrieval prompt 4 — debug a failed run | |||
| Use this for ops and evaluation workflows. These prompts matter because retrieval is not just search. It is search with intent. | |||
| ### 5.6 Honest admission | |||
| Agents are brittle. They are hard to debug. They are non-deterministic. They fail in ways that look intelligent. A demo can work ten times, and fail on the eleventh for a tiny wording change. That is normal. Not pleasant. But normal. Please keep three truths together. | |||
| #### Truth 1 | |||
| Agents can unlock workflows that one-shot LLM calls cannot handle well. | |||
| #### Truth 2 | |||
| Most agent failures are systems problems, not purely model problems. Bad schemas, bad prompts, bad retries, bad state design, and bad stop rules cause a lot of pain. | |||
| #### Truth 3 | |||
| A workflow should not become agentic by default. If prompt + retrieval solves the problem, stop there. If a deterministic backend workflow solves it, stop there. Use agents when uncertainty and branching are real. This honesty is senior behavior. | |||
| ### 5.7 Foundation-gap audit for module 10 | |||
| Module 10 assumes you already own these basics. If any item below feels fuzzy, fix it now. | |||
| #### Audit checklist | |||
| - Single-agent loop mechanics | |||
| - Can you explain think → act → observe without notes? | |||
| - Can you code a basic loop manually? | |||
| - Tool schema design | |||
| - Can you name fields, | |||
| enums, and descriptions that reduce misuse? | |||
| - Error handling in loops | |||
| - Can you describe validation errors, | |||
| retryable errors, and terminal errors? | |||
| - When to stop / give up | |||
| - Can you explain max-step caps, | |||
| budget caps, and escalation thresholds? | |||
| - State across turns | |||
| - Can you separate message history, | |||
| structured state, and memory? If these are weak, MCP and multi-agent orchestration will feel fancy but hollow. | |||
| ### 5.8 Bridge to next module | |||
Next module — 16_multi_agent_coordination — scales this pattern. Multiple agents coordinating. Standardized tool protocols (MCP). Orchestration patterns for complex workflows. See the order clearly. First, one agent learns to think, call, observe, and stop safely. Then, multiple agents coordinate. First, you design one clean tool schema. Then, you think about standardized tool protocols. First, you manage state in one loop. Then, you manage state across many loops. That is why module 09 comes first. |
|||
| --- | |||
| ## Chapter 6: Recap & application | |||
| ### 6.1 The failure-fix table | |||
| Visual recap. Every row is one common failure, and the corresponding fix. | |||
| Failure | Fix | ||
| --- | --- | ||
| Model uses a tool once, then hallucinates later steps | Use a ReAct loop that checks every critical step | ||
| Agent keeps searching forever | Add max-step caps and repeated-call detection | ||
| Agent picks the wrong tool | Narrow tool scopes and improve descriptions | ||
| Agent invents arguments | Validate with strict schemas | ||
| Tool crashes leak raw exceptions | Return structured error objects | ||
| Duplicate side effects happen | Add idempotency keys and approval gates | ||
| Prompt grows too large | Summarize observations into structured state | ||
| Expensive model handles trivial routing | Use a smaller router or rule-based front door | ||
| Parallel calls create conflicts | Parallelize only independent read tools | ||
| Agent answers without evidence | Require observations before high-confidence answers | ||
| Write action is risky | Add human-in-the-loop approval | ||
| Sketch this table from memory. That alone will reveal your weak spots. | |||
| ### 6.2 Key points to remember | |||
| - A tool call is not an agent. | |||
| - The loop is the product. | |||
| - ReAct means think, | |||
| act, observe, repeat. | |||
| - Tool names and descriptions are routing signals. | |||
| - Schemas reduce argument hallucination. | |||
| - Structured errors make recovery possible. | |||
| - Idempotency matters because agents repeat themselves. | |||
| - Stop rules and give-up rules are different. | |||
| - State should be explicit, | |||
| not accidental. | |||
| - Parallelism helps latency only when dependencies allow it. | |||
| - Agents are useful, | |||
| but they are not the default answer. | |||
| ### 6.3 Important interview questions | |||
| Q1. What is the difference between tool calling and an agent? Tool calling can be one step. An agent is a loop with state, observation, and stopping logic. Q2. Why do narrow tools outperform one mega-tool? Because tool selection is a classification problem. Narrow tools reduce ambiguity, argument confusion, and unsafe overlap. Q3. Why is structured error handling important? Because the model can recover only if failure shape is legible. Raw exceptions are bad observations. Typed errors are useful observations. Q4. When should you add human approval? When the action is high risk, expensive, irreversible, or externally visible. Q5. How do you stop an agent safely? Use step caps, budget caps, repeated-failure detection, and task-specific completion checks. Q6. When is chain-of-thought alone enough? When the task is internal reasoning over already-present context, with no need for fresh external state or side effects. Q7. What usually breaks first in production agents? Tool design, not frontier-model intelligence. Bad schemas, weak descriptions, and missing guardrails usually show up early. | |||
| ### 6.4 Production experience: latency, cost, and debugging | |||
| In production, you will care about three numbers fast. | |||
| 1. success rate | |||
| 2. latency | |||
| 3. cost per task | |||
| The mistake is optimizing only one. A very cheap agent that fails is useless. A very accurate agent that takes twenty seconds is often useless. A very fast agent that issues duplicate refunds is dangerous. | |||
| ### Practical lessons | |||
| #### Lesson 1: Measure per step | |||
| Do not log only total latency. Log latency per tool and per model call. That reveals whether search, retrieval, or the model is the bottleneck. | |||
| #### Lesson 2: Cap the blast radius | |||
| Early versions should have: | |||
| - read-heavy toolsets | |||
| - low step caps | |||
| - narrow user cohorts | |||
| - human review on writes | |||
| Expand after evidence, not before. | |||
| #### Lesson 3: Evaluate the trace, | |||
| not only the final answer Sometimes the final answer looks correct, but the path was unsafe. That path will later fail on a harder case. So evaluate: | |||
| - correct tool chosen? | |||
| - correct order? | |||
| - correct arguments? | |||
| - correct stop behavior? | |||
| - correct escalation? | |||
| #### Lesson 4: Budget for retries consciously | |||
| Retries are not free. A retry policy can quietly double cost. If retries exist, label them. Track them. Audit them. | |||
| #### Lesson 5: Your best debugging artifact is a bad trace library | |||
| Keep examples of ugly failures. | |||
| - loop storms | |||
| - wrong-tool picks | |||
| - good final answer via bad intermediate state | |||
| - duplicate writes | |||
| - fake IDs | |||
| These traces become gold for prompts, regression tests, and interviews. | |||
| ### 6.5 Apply now — graded exercises | |||
| #### Easy (5 minutes) | |||
| For each user request below, write the next best action. Do not solve the whole task. Just write the next action. | |||
| 1. "I think I was charged twice." | |||
| 2. "How do I reset MFA?" | |||
| 3. "Please delete my company workspace right now." | |||
| Check yourself. Did you choose a tool, a clarifying question, or a human gate correctly? | |||
| #### Medium (15 minutes) | |||
| Design three tool schemas for a support agent. Requirements: | |||
| - one read-only lookup tool | |||
| - one knowledge-base search tool | |||
| - one escalation tool | |||
| For each, write: | |||
| - tool name | |||
| - description | |||
| - arguments | |||
| - one structured error example | |||
| - whether it is idempotent | |||
| If your tool names still sound vague, redo them. | |||
| #### Hard (30 minutes) | |||
| Implement a tiny manual ReAct loop. You may mock the tools. | |||
| Now support this flow: | |||
| - ask user for a multi-step shopping calculation | |||
| - force tool use for each exact arithmetic step | |||
- stop after MAX_STEPS = 5 |
|||
| - log every observation | |||
| Then intentionally break one tool. Confirm the agent exits safely. | |||
| #### Very hard (45 minutes) | |||
| Take your Week 9 hands_on_lab plan. Design the evaluation sheet before implementation. Include columns for: | |||
| - query type | |||
| - expected tool sequence | |||
| - expected escalation or refusal | |||
| - correctness of final answer | |||
| - latency bucket | |||
| - notes on failure mode | |||
| This is how mature agent work begins. | |||
| ### 6.6 Final retrieval | |||
| Without looking, answer these from memory. | |||
| 1. What are the five named placeholders from the handyman analogy? | |||
| 2. Why is one good tool call not enough? | |||
| 3. What are the three ReAct steps? | |||
| 4. Give one good tool description, | |||
| and one bad one. | |||
| 5. Name three guardrails. | |||
| 6. When should an agent give up? | |||
| 7. What five basics does module 10 assume from module 09? | |||
If you can answer these cleanly, you are ready. Then open 16_multi_agent_coordination. You will see the scaling story much more clearly. |