Skip to content

05. Assignment 4 — Customer Support Agent with Tool Calling

Week 9. ReAct loop. Tool schemas. Guardrails. Evaluation discipline.

Goal

Build a customer support agent for a hypothetical SaaS product. The agent should answer FAQ-style questions, look up account information, and escalate safely when needed.

This hands_on_lab is not about building the fanciest chatbot. It is about showing that you understand a production-shaped loop.

Scenario

Your SaaS product has these common issues:

  • login and MFA problems
  • billing confusion
  • workspace setup questions
  • outage/status checks
  • requests that should be escalated to humans

Your support agent should behave sensibly across all five.

Required tools (3+)

  1. search_kb(query, product_area)
  2. Read-only search over a mock help center or FAQ set.
  3. lookup_account(user_id)
  4. Read-only lookup for plan, account status, and recent auth flags.
  5. escalate_to_human(category, summary)
  6. Mock escalation tool that records a handoff.

Optional tools:

  1. lookup_invoice(invoice_id)
  2. create_ticket(subject, description, request_id)
  3. reset_mfa(user_id) (only if you also add HITL or a strict safety policy)

Required architecture

You may use a framework, or write the loop manually. But your design must clearly show the ReAct pattern.

User query
Agent loop
   - think
   - choose tool or answer
   - execute tool
   - observe result
   - update state
   - stop or repeat
Final answer / clarification / escalation

Minimum loop rules

  • MAX_STEPS cap required
  • tool arguments must be validated
  • tool errors must be structured, not raw exceptions
  • at least one explicit stop rule
  • at least one give-up rule
  • at least one human-in-the-loop or safe escalation policy

Cross-ref: - explainer chapter 2 for loop mechanics - explainer chapter 3 for schemas - explainer chapter 4 for guardrails

Required deliverables

  1. agent.py or equivalent
  2. agent definition
  3. tool registry
  4. loop implementation
  5. tools.py or equivalent
  6. tool implementations
  7. structured error returns
  8. validation logic
  9. evals/gold_queries.json
  10. 30 gold queries minimum
  11. evals/run.py
  12. runs all queries, records traces, reports metrics
  13. README.md
  14. architecture diagram
  15. tool descriptions
  16. guardrails
  17. eval results
  18. top failure modes

Gold query mix

Minimum 30 queries:

  • 10 should be answered from KB
  • 8 should require account lookup
  • 4 should require invoice or billing reasoning
  • 4 should escalate
  • 4 should refuse or ask for clarification

Do not make the eval set too easy. Include messy language. Include incomplete IDs. Include two-intent queries.

What to evaluate

For each query, record at least these columns:

  • query type
  • expected tool sequence
  • actual tool sequence
  • argument validity
  • final answer correctness
  • escalation correctness
  • latency
  • notes

Core metrics

  • tool-selection accuracy
  • argument-validation pass rate
  • final-answer correctness
  • safe-escalation rate
  • refusal / clarification correctness
  • p50 / p95 latency

Success criteria

  • Tool selection accuracy ≥ 85%
  • Final answer correctness ≥ 75%
  • Safe escalation behavior ≥ 95%
  • No infinite loops
  • No duplicate side effects
  • Clear logs for every failed run

Required engineering choices

1. Use typed schemas

Use Pydantic or an equivalent validation layer. Do not accept free-form dict payloads everywhere.

2. Make write behavior safe

If you include any write tool, show one of these protections:

  • idempotency key
  • read-before-write policy
  • human approval gate

3. Log the trace

Keep a per-query trace with:

  • each model step
  • each tool call
  • arguments
  • tool result
  • stop reason

That trace will help you debug quickly.

4. Show at least one failure honestly

In your README, include one failure where the loop was wrong, and explain the fix you would try next.

Common pitfalls

  • Too many vague tools
  • No MAX_STEPS cap
  • Tool descriptions that only say what, not when
  • Raw exceptions bubbling into the loop
  • Hidden writes with no safeguards
  • Eval set with only easy cases
  • No distinction between stop and give-up behavior

Stretch goals

These are optional, but useful if you want a stronger bridge to module 10:

  • dynamic tool subset selection
  • parallel read-only tool calls
  • structured working state object
  • lightweight memory between turns
  • tiny cost tracker per query

Why this hands_on_lab matters

Customer support remains one of the clearest production agent use cases. If you ship this well, you show more than prompt skill. You show workflow engineering.

And that is exactly the bridge to module 10, where the same pattern scales into MCP and multi-agent systems.