05. Assignment 4 — Customer Support Agent with Tool Calling¶

Week 9. ReAct loop. Tool schemas. Guardrails. Evaluation discipline.

Goal¶

Build a customer support agent for a hypothetical SaaS product. The agent should answer FAQ-style questions, look up account information, and escalate safely when needed.

This hands_on_lab is not about building the fanciest chatbot. It is about showing that you understand a production-shaped loop.

Scenario¶

Your SaaS product has these common issues:

login and MFA problems
billing confusion
workspace setup questions
outage/status checks
requests that should be escalated to humans

Your support agent should behave sensibly across all five.

Required tools (3+)¶

search_kb(query, product_area)
Read-only search over a mock help center or FAQ set.
lookup_account(user_id)
Read-only lookup for plan, account status, and recent auth flags.
escalate_to_human(category, summary)
Mock escalation tool that records a handoff.

Optional tools:

lookup_invoice(invoice_id)
create_ticket(subject, description, request_id)
reset_mfa(user_id) (only if you also add HITL or a strict safety policy)

Required architecture¶

You may use a framework, or write the loop manually. But your design must clearly show the ReAct pattern.

User query
   ↓
Agent loop
   - think
   - choose tool or answer
   - execute tool
   - observe result
   - update state
   - stop or repeat
   ↓
Final answer / clarification / escalation

Minimum loop rules¶

MAX_STEPS cap required
tool arguments must be validated
tool errors must be structured, not raw exceptions
at least one explicit stop rule
at least one give-up rule
at least one human-in-the-loop or safe escalation policy

Cross-ref: - explainer chapter 2 for loop mechanics - explainer chapter 3 for schemas - explainer chapter 4 for guardrails

Required deliverables¶

agent.py or equivalent
agent definition
tool registry
loop implementation
tools.py or equivalent
tool implementations
structured error returns
validation logic
evals/gold_queries.json
30 gold queries minimum
evals/run.py
runs all queries, records traces, reports metrics
README.md
architecture diagram
tool descriptions
guardrails
eval results
top failure modes

Gold query mix¶

Minimum 30 queries:

10 should be answered from KB
8 should require account lookup
4 should require invoice or billing reasoning
4 should escalate
4 should refuse or ask for clarification

Do not make the eval set too easy. Include messy language. Include incomplete IDs. Include two-intent queries.

What to evaluate¶

For each query, record at least these columns:

query type
expected tool sequence
actual tool sequence
argument validity
final answer correctness
escalation correctness
latency
notes

Core metrics¶

tool-selection accuracy
argument-validation pass rate
final-answer correctness
safe-escalation rate
refusal / clarification correctness
p50 / p95 latency

Success criteria¶

Tool selection accuracy ≥ 85%
Final answer correctness ≥ 75%
Safe escalation behavior ≥ 95%
No infinite loops
No duplicate side effects
Clear logs for every failed run

Required engineering choices¶

1. Use typed schemas¶

Use Pydantic or an equivalent validation layer. Do not accept free-form dict payloads everywhere.

2. Make write behavior safe¶

If you include any write tool, show one of these protections:

idempotency key
read-before-write policy
human approval gate

3. Log the trace¶

Keep a per-query trace with:

each model step
each tool call
arguments
tool result
stop reason

That trace will help you debug quickly.

4. Show at least one failure honestly¶

In your README, include one failure where the loop was wrong, and explain the fix you would try next.

Common pitfalls¶

Too many vague tools
No MAX_STEPS cap
Tool descriptions that only say what, not when
Raw exceptions bubbling into the loop
Hidden writes with no safeguards
Eval set with only easy cases
No distinction between stop and give-up behavior

Stretch goals¶

These are optional, but useful if you want a stronger bridge to module 10:

dynamic tool subset selection
parallel read-only tool calls
structured working state object
lightweight memory between turns
tiny cost tracker per query

Why this hands_on_lab matters¶

Customer support remains one of the clearest production agent use cases. If you ship this well, you show more than prompt skill. You show workflow engineering.

And that is exactly the bridge to module 10, where the same pattern scales into MCP and multi-agent systems.