05. Assignment 4 — Customer Support Agent with Tool Calling¶
Week 9. ReAct loop. Tool schemas. Guardrails. Evaluation discipline.
Goal¶
Build a customer support agent for a hypothetical SaaS product. The agent should answer FAQ-style questions, look up account information, and escalate safely when needed.
This hands_on_lab is not about building the fanciest chatbot. It is about showing that you understand a production-shaped loop.
Scenario¶
Your SaaS product has these common issues:
- login and MFA problems
- billing confusion
- workspace setup questions
- outage/status checks
- requests that should be escalated to humans
Your support agent should behave sensibly across all five.
Required tools (3+)¶
search_kb(query, product_area)- Read-only search over a mock help center or FAQ set.
lookup_account(user_id)- Read-only lookup for plan, account status, and recent auth flags.
escalate_to_human(category, summary)- Mock escalation tool that records a handoff.
Optional tools:
lookup_invoice(invoice_id)create_ticket(subject, description, request_id)reset_mfa(user_id)(only if you also add HITL or a strict safety policy)
Required architecture¶
You may use a framework, or write the loop manually. But your design must clearly show the ReAct pattern.
User query
↓
Agent loop
- think
- choose tool or answer
- execute tool
- observe result
- update state
- stop or repeat
↓
Final answer / clarification / escalation
Minimum loop rules¶
MAX_STEPScap required- tool arguments must be validated
- tool errors must be structured, not raw exceptions
- at least one explicit stop rule
- at least one give-up rule
- at least one human-in-the-loop or safe escalation policy
Cross-ref: - explainer chapter 2 for loop mechanics - explainer chapter 3 for schemas - explainer chapter 4 for guardrails
Required deliverables¶
agent.pyor equivalent- agent definition
- tool registry
- loop implementation
tools.pyor equivalent- tool implementations
- structured error returns
- validation logic
evals/gold_queries.json- 30 gold queries minimum
evals/run.py- runs all queries, records traces, reports metrics
README.md- architecture diagram
- tool descriptions
- guardrails
- eval results
- top failure modes
Gold query mix¶
Minimum 30 queries:
- 10 should be answered from KB
- 8 should require account lookup
- 4 should require invoice or billing reasoning
- 4 should escalate
- 4 should refuse or ask for clarification
Do not make the eval set too easy. Include messy language. Include incomplete IDs. Include two-intent queries.
What to evaluate¶
For each query, record at least these columns:
- query type
- expected tool sequence
- actual tool sequence
- argument validity
- final answer correctness
- escalation correctness
- latency
- notes
Core metrics¶
- tool-selection accuracy
- argument-validation pass rate
- final-answer correctness
- safe-escalation rate
- refusal / clarification correctness
- p50 / p95 latency
Success criteria¶
- Tool selection accuracy ≥ 85%
- Final answer correctness ≥ 75%
- Safe escalation behavior ≥ 95%
- No infinite loops
- No duplicate side effects
- Clear logs for every failed run
Required engineering choices¶
1. Use typed schemas¶
Use Pydantic or an equivalent validation layer.
Do not accept free-form dict payloads everywhere.
2. Make write behavior safe¶
If you include any write tool, show one of these protections:
- idempotency key
- read-before-write policy
- human approval gate
3. Log the trace¶
Keep a per-query trace with:
- each model step
- each tool call
- arguments
- tool result
- stop reason
That trace will help you debug quickly.
4. Show at least one failure honestly¶
In your README, include one failure where the loop was wrong, and explain the fix you would try next.
Common pitfalls¶
- Too many vague tools
- No
MAX_STEPScap - Tool descriptions that only say what, not when
- Raw exceptions bubbling into the loop
- Hidden writes with no safeguards
- Eval set with only easy cases
- No distinction between stop and give-up behavior
Stretch goals¶
These are optional, but useful if you want a stronger bridge to module 10:
- dynamic tool subset selection
- parallel read-only tool calls
- structured working state object
- lightweight memory between turns
- tiny cost tracker per query
Why this hands_on_lab matters¶
Customer support remains one of the clearest production agent use cases. If you ship this well, you show more than prompt skill. You show workflow engineering.
And that is exactly the bridge to module 10, where the same pattern scales into MCP and multi-agent systems.