08. Tool-layer bugs — where physical reality bites the agent¶

~14 min read. The second suspect in the lineup. Tools touch real APIs, real money, real users. When they lie quietly, the agent believes the lie.

Built on the ELI5 in 00-eli5.md. The suspects — five layers that could have caused the crime — face their second interrogation. Prompt walked free. Time to question the tool.

The picture before the diagnostics¶

The prompt only writes words, but the tool moves money, sends emails, and deletes rows. That is why a tool's quiet lie costs more than any other layer's. When the suspect is the tool, the damage has already touched a real system by the time you read the trace.

   prompt-side schema             tool-side reality
   (what the agent sees)          (what actually runs)
   ┌────────────────┐             ┌─────────────────────┐
   │  refund(       │   ① reads   │  refund(            │
   │   order_id:str │  ② picks    │   order_id:str      │
   │   amount:float)│  ③ validate │   amount:int cents) │◀ drift
   └────────┬───────┘  ④ runs     └──────────┬──────────┘
            │          ⑤ response back        │
            │          ⑥ agent reads response │
            ▼                                 ▼
       ┌─────────────────────────────────────────┐
       │ gap between these two sides = bug class │
       └─────────────────────────────────────────┘

The case file for a tool bug is this diagram, filled in for one trace. Each bug below strikes a different arrow.

The seven tool-bug patterns¶

Walk the lineup. Mark each suspect cleared or guilty.

1. Schema drift — arrow ①¶

Tool signature changed on the server. Prompt-side declaration did not. Agent sends old args. Tool ignores some, or 400s.

Signature. Recent tool deploy, mysterious 400s, weird defaults.
Test. Diff live spec (OpenAPI/MCP) vs prompt schema. Any name, type, required-flag mismatch is the confession.
Fix. Generate prompt schema from tool spec at build time. One source of truth.

2. Silent argument coercion — arrow ③¶

Agent emits "42". Pydantic coerces to 42. Or "2024-01-01" becomes a UTC datetime — Sydney users get off-by-one day.

Signature. Agent self-corrects later — "I notice the value differs."
Test. Log raw JSON model emitted AND parsed args. Any non-identity transform is suspect.
Fix. strict=True on Pydantic at tool boundary. Reject, don't coerce.

3. Hallucinated arguments — arrow ②¶

Model wrote customer_id="C-99999". Customer never existed. Schema said str — anything passed. Tool returned empty. Agent treated empty as "new customer" and made a duplicate. Most common production tool bug.

Signature. IDs look plausible but don't exist. First appearance of the ID is inside tool args, not in any earlier context.
Test. Did agent see this ID earlier in the trace, or invent it now? First-appearance-in-args = invented.
Fix. Two locks. Tighten schema: pattern ^C-[0-9]{6}$. And tool returns explicit not_found, not empty list.

4. Errors masked as success — arrow ⑤¶

return {"status": "ok", "result": null, "warning": "no match"}

HTTP 200. Agent reads green light, uses null as a value.

Signature. Agent confidently reports empty/zero/None. Downstream acts on missing value.
Test. Grep responses for truthy-status + falsy-payload combos.
Fix. Tools fail loudly — HTTP 4xx or raise. Never wrap errors in success envelopes.

5. Partial tool responses — arrow ⑤¶

Tool should return 12 fields, returned 7. Join dropped rows, or downstream degraded. Agent found enough to proceed.

Signature. Final answer has "unknown" in fields the user knows exist.
Test. Compare returned field set vs schema's expected set.
Fix. Return all expected fields or fail. Absent fields are explicit nulls with a reason — never omitted.

6. Idempotency violations — arrow ④¶

Tool timed out at 30s. Retry kicked in. Server processed both. Refund issued twice. Complaint slip filed by finance.

attempt 1: POST /refund {O1}  ──→ server processing
              ↓ 30s timeout
attempt 2: POST /refund {O1}  ──→ processes again — double charge

Signature. Duplicate writes with retry_count > 0 on a mutating span.
Test. For each retried mutation, check if first attempt completed server-side.
Fix. Mutating tools accept idempotency_key. Agent generates one per logical action, reuses on retry. Server dedupes. Non-negotiable for money.

7. Tool-side bugs — arrow ④ (downstream)¶

Downstream API deprecated v1. Auth token expired. Rate limit tightened. Agent did right. Tool layer broke under it.

Signature. Error spike across users, models, prompts. One tool only. Timestamp matches external change.
Test. Compare this week's tool error rate to last week's, by code. Check vendor status page.
Fix. Lives in the tool — circuit breakers, token refresh, vendor health monitoring. Detect fast, don't self-blame.

Compare-spec-vs-prompt diagnostic¶

Highest-yield tool-layer check. Do it first.

┌──────────────────────┐       ┌──────────────────────┐
│ tool server's spec   │ diff  │ schema in prompt     │
│ (OpenAPI/MCP/live)   │ ────→ │ (what model was told)│
└──────────┬───────────┘       └──────────┬───────────┘
           └────────────┬─────────────────┘
                        ▼
       mismatch in names, types, required-flag,
              enums, descriptions = bug

Automate it. Run in CI. Fail the build on any drift.

Worked example — refund-on-wrong-order¶

User complained: wrong order refunded. Complaint slip opened. Pull the case file.

tr_refund_881
├── tool.find_customer({"email":"a@x.com"})
│      → { id:"C-100234", orders:["O-9001","O-9002","O-9003"] }
├── tool.get_order({"order_id":"O-9091"})     ◀ ① 9091, not 9001
│      → { id:"O-9091", amount:200, status:"shipped" }
└── tool.issue_refund({"order_id":"O-9091"})
       → { status:"ok", refund_id:"R-77" }

Witness note ①. Customer's orders were O-9001, O-9002, O-9003. Agent called get_order with O-9091. Where from? Nowhere in this trace. Model transposed digits and hallucinated.

Schema check. order_id was plain str. No pattern. No server-side authorization. O-9091 happened to be a real order belonging to another customer. Refund went through. Confession — pattern 3 (hallucinated args) compounded with pattern 7 (no tool-side authorization).

Two locks. First, tighten schema: pattern: "^O-[0-9]{4}$" plus server-side check that the order belongs to the authenticated customer. Second, regression eval — given that customer state, assert every get_order uses an ID from the prior find_customer response. Add to the lock set so the same crime cannot return.

The tool-failure injector¶

Production crashes teach late. Inject failures in staging.

┌──────────────────────────────────────────────────┐
│       tool-failure injector (test harness)        │
├──────────────────────────────────────────────────┤
│  for each tool, replay agent eval set with:       │
│   • null result, status=ok                        │
│   • half the expected fields                      │
│   • a 500                                         │
│   • 200 after 31s (timeout then late success)     │
│   • same response twice (duplicate)               │
│   • coerced types (string for int)                │
│  score: did the agent notice? did it recover?     │
└──────────────────────────────────────────────────┘

Below 90% catch rate is a missing lock.

Tool-layer bug patterns across agent frameworks¶

Stripe API integrations — refund agents rely on idempotency keys; without them, a 30s timeout retry double-charges. Stripe's docs require idempotency keys on every POST for this reason.
Stripe agent toolkit — typed money tools (amount in integer cents, currency code as enum) push the coercion fight to schema-build time rather than runtime; the role is "make the unit ambiguity impossible to express."
GitHub Copilot Workspace — when MCP read_file gained an eof_truncated flag, older prompt-side declarations ignored it and Copilot agents silently truncated diffs in generated patches.
GitHub Copilot extensions — tool manifests are versioned per extension; a host upgrade without manifest refresh is the canonical schema-drift trigger Copilot's docs warn about.
Anthropic tool-use API — input_schema is strict JSON-schema, but models still emit stringified numbers occasionally; SDK coerces unless raw input is validated before execution. tool_use blocks are validated server-side, so a malformed call surfaces as an error block the agent can read and self-correct on.
OpenAI function-calling strict mode — strict: true on function definitions makes the model emit exactly the declared schema; the role is to close arrow ② (hallucinated arguments) at the decoding layer rather than at the tool boundary.
LangChain @tool decorator — wraps a Python function as a tool whose schema comes from type hints and the docstring; the role is fast iteration, the failure mode is silent schema/docstring drift when the function signature is edited without re-deploying the prompt.
LangChain StructuredTool — default Pydantic coercion has silently turned ISO date strings into UTC datetimes, shifting user inputs by a day across timezones in production agents. Pattern 2 in the wild.
LangGraph ToolNode — central tool executor that returns ToolMessage envelopes; when a tool raises, the envelope's status="error" is the only signal the loop sees, so a tool that swallows errors and returns status="ok" (pattern 4) defeats the whole graph.
Pydantic AI typed tools — RunContext-aware tools with strict validators reject coerced args at the boundary, making pattern 2 a build-time error rather than a production surprise.
BAML schemas — schema-first language whose generated client refuses to compile if the prompt schema and tool schema diverge; the role is to make pattern 1 (schema drift) impossible by construction.
Vercel AI SDK tool(...) — Zod-validated tools where invalid args surface as InvalidToolArgumentsError; the SDK forwards the error back to the model on the next turn so the agent can self-correct instead of crashing the loop.
MCP servers — when servers add optional args, older clients keep working but newer prompts include the field; cached client schema vs live server schema mismatch is a daily debugging pattern in Claude Desktop and Cursor.
Cursor's tool-error retry loops — Cursor's agent feeds tool errors back to the model with a bounded retry budget; the role is to make pattern 4 (errors masked as success) self-correcting only when errors are emitted loudly.
AWS Bedrock Converse API — toolConfig validates tool input against the declared schema before invocation; mismatches return a validationException that the agent must handle rather than a silently coerced call.
Salesforce Einstein Actions — actions are declared with input/output types in metadata; the role of the runtime is to refuse a call whose args do not match the declared types, surfacing pattern 1 and pattern 3 immediately.
Microsoft Copilot Studio plugins — manifest-driven plugin schemas with required-flag enforcement; a flipped required field across versions is the most reported schema-drift bug in the Copilot Studio forums.

Recall — name the tool bug from the trace¶

Difference between schema drift and silent argument coercion?
Why is status: "ok", result: null more dangerous than HTTP 500?
In the worked example, which two tool-bug patterns combined?
What does the tool-failure injector test that production tracing alone cannot?

Interview Q&A¶

Q: A retry doubled a charge. Where is the bug — agent, tool, or both? A: Both, but the fix lives in the tool. The agent's retry policy is fine; timeouts are real. The tool must dedupe by an idempotency_key the agent passes per logical action. Without that, any retry — agent, framework, load balancer — risks duplicate writes.

Common wrong answer to avoid: "Disable retries on mutating tools" — that trades duplicate writes for lost writes when timeouts are transient. Idempotency keys give you both safety and resilience.

Q: Hallucinated customer ID. Tool returns empty list. Agent proceeds. Where is the design failure? A: Two places. The tool's contract is ambiguous — empty list could mean "no records" or "bad ID." It should return explicit not_found. And the schema is too loose — if IDs follow a pattern, enforce via regex so invented IDs fail at validation, not semantics.

Common wrong answer to avoid: "Add a retry with a different ID" — retries don't fix invented data. The agent has no ground truth to correct toward. You need stricter validation and clearer error signals.

Q: Why isn't Pydantic's coercion a feature for tool calls? A: Coercion hides the gap between what the model said and what the tool ran. Convenient for humans, lossy for an LLM that may emit subtly wrong values. At tool boundaries you want strict validation so model mistakes surface as errors the agent can see and correct.

Common wrong answer to avoid: "Pydantic coercion follows the JSON spec" — it doesn't. JSON has distinct types; Pydantic adds string-to-number and date-string-to-datetime conversions that are lossy for audit.

Q: Schema drift vs model hallucinating a field name — how to tell? A: Compare live tool spec to prompt-side schema. If prompt says customer_id but tool now expects customerId, that's drift — your fault for not regenerating. If both say customer_id but the model emitted customer_identifier, that's hallucination. Different fixes: regenerate vs stricter validator.

Common wrong answer to avoid: "Look at the error message" — both produce similar validation errors. The diff between live spec and prompt schema is what separates them.

Apply now (10 min)¶

Step 1 — model the exercise. Take the refund-on-wrong-order trace from earlier in this chapter. Walk it through the seven-pattern lineup in order: pattern 1 (no recent deploy, schema unchanged — cleared), pattern 2 (raw args matched parsed args — cleared), pattern 3 (O-9091 never appeared in any earlier span — guilty), pattern 4 (response was a real success, not a masked error — cleared), pattern 5 (all expected fields returned — cleared), pattern 6 (no retry — cleared), pattern 7 (tool authorized the call because it had no tenant check — guilty). The confession is pattern 3 compounded by pattern 7. The two locks are a regex schema and a server-side ownership check.

Step 2 — your turn. Take any tool in your own agent. Print its live OpenAPI or MCP spec and the schema embedded in the prompt. Diff them — every mismatch in name, type, required-flag, enum, or description counts. Then pick one mutating tool: does it accept an idempotency key? If not, that is your highest-priority fix this week. Walk the same seven-pattern lineup on the most recent complaint slip your team filed against this tool.

Step 3 — reproduce from memory. Without scrolling up, draw the prompt → schema → tool call → response → next prompt flow with the six numbered arrows. Label which arrow each of the seven bug patterns strikes. Mark the compare-spec-vs-prompt diagnostic on arrow ①. Then connect this chapter to chapter 06's elimination order in one sentence: when the prompt lineup clears the prompt, the tool is the next suspect because its lies cost the most.

What you should remember¶

This chapter walked the second suspect in the agent debugging lineup — the tool layer, where words turn into refunds, emails, and deleted rows. You learned that every tool bug strikes one of six arrows between the prompt-side schema and the tool-side reality: schema drift on arrow ①, hallucinated arguments on arrow ②, silent coercion on arrow ③, idempotency violations on arrow ④, errors masked as success or partial responses on arrow ⑤, and tool-side downstream breakage on arrow ④ again. The case file for a tool bug is that diagram filled in for one trace.

The seven patterns are not seven unrelated bugs; they are the seven ways the contract between agent and tool can lie. The fix is almost always to make the lie loud — strict validation at the boundary, explicit not_found instead of empty lists, all expected fields or a hard failure, idempotency keys on every mutating POST. The diagnostic that catches the most bugs cheapest is the spec-vs-prompt diff run in CI, because it forces the contract to stay honest while the build still has a chance to fail.

Carry this diagnostic forward. When a complaint slip says "wrong refund, wrong email, wrong row deleted," open the case file and ask which of the six arrows the lie travelled along before asking which model produced it. The model almost never deletes the wrong row; the tool with a loose schema and a quiet failure mode does.

Remember:

The tool is the second suspect after the prompt, and its lies cost real money — interrogate it before climbing to the loop, memory, or model.
Six arrows, seven patterns: schema drift, hallucinated args, silent coercion, idempotency, masked errors, partial responses, downstream breakage. Each strikes a known arrow in the case file.
The highest-yield confession test is the live spec vs prompt schema diff — automate it in CI so drift fails the build, not the user.
Mutating tools without an idempotency key are a lock waiting to be set; treat the absence as a production incident scheduled for whenever the next 30s timeout hits.
Make every tool failure loud: explicit not_found, explicit null with reason, HTTP 4xx on errors, all expected fields returned or none. Silent success is the most expensive lie a tool can tell.

Bridge. The tool cleared the lineup — or confessed, and we set a lock. Either way the next suspect is the loop itself. The agent that calls tools without progress. That stops too early. That oscillates between two tools forever. → 09-loop-layer-bugs.md