03. Contracts between model and world — Schemas that type the action, descriptions that route the intent¶

~20 min read. A tool contract has two halves: the schema says what shape of argument the world accepts; the description says when the model should reach for the tool at all. One half without the other fails — tight schema with vague description means the right arguments fly at the wrong tool; sharp description with loose schema means the right tool receives garbage. Together they form the typed contract between model and world.

The heartbeat fires — but at what?¶

The ReAct loop is running. The heartbeat ticks. On each beat, the model must answer two questions in order:

Which tool? — a routing decision driven by names and descriptions.
What arguments? — a generation decision constrained by schemas.

Get (1) wrong, and perfect arguments go to the wrong endpoint. Get (2) wrong, and the right tool receives malformed input. Both questions live in the same token stream the model reads. Both are answered before a single byte crosses the network.

The heartbeat is only as good as the tools it can fire. This file makes the tools legible.

Two schemas, two descriptions, four outcomes¶

Watch the same user intent land in four different realities.

Loose schema, vague description.

{
  "name": "refund",
  "description": "Process a refund.",
  "parameters": { "details": {"type": "string"} }
}

The model invents: {"details": "Refund order 4481 for ₹1250, delay, notify"}. Your tool must parse a sentence. The model may not even pick this tool — two other tools say "process" in their descriptions. Routing is a coin flip. Execution is string parsing at runtime. Double failure.

Tight schema, vague description.

{
  "name": "issue_refund",
  "description": "Issue a refund.",
  "parameters": {
    "order_id": {"type": "string", "pattern": "^[0-9]{4,10}$"},
    "amount_cents": {"type": "integer", "minimum": 1, "maximum": 20000000},
    "currency": {"type": "string", "enum": ["INR","USD","EUR"]},
    "reason": {"type": "string", "enum": ["delay","duplicate","courtesy"]}
  },
  "required": ["order_id","amount_cents","currency","reason"],
  "additionalProperties": false
}

Arguments are perfect — if the model picks this tool. But "Issue a refund" is indistinguishable from the neighbouring cancel_order that also says "Process a cancellation." Routing still fails on ambiguous prompts.

Loose schema, sharp description.

{
  "name": "issue_refund",
  "description": "Refund money to a customer after policy_check approves. Use when the customer asks for money back on a delivered or late order. Do not use for subscription cancellations (use cancel_subscription). Returns refund_id and status.",
  "parameters": { "details": {"type": "string"} }
}

Routing works — the model now knows when to reach. But the call arrives as {"details": "order 4481, 1250 rupees, delay"}. The tool must parse a blob. Schema gave the model nowhere to be precise.

Tight schema, sharp description — the full contract.

{
  "name": "issue_refund",
  "description": "Refund money to a customer after policy_check approves. Use when the customer asks for money back on a delivered or late order. Do not use for subscription cancellations (use cancel_subscription). Returns refund_id, amount_cents, status.",
  "parameters": {
    "order_id":     {"type": "string", "pattern": "^[0-9]{4,10}$"},
    "amount_cents": {"type": "integer", "minimum": 1, "maximum": 20000000},
    "currency":     {"type": "string", "enum": ["INR","USD","EUR"]},
    "reason":       {"type": "string", "enum": ["delay","duplicate","courtesy"]},
    "notify_customer": {"type": "boolean"}
  },
  "required": ["order_id","amount_cents","currency","reason","notify_customer"],
  "additionalProperties": false
}

Routing is precise. Arguments are constrained. The worst valid call is a correctly-shaped refund that business logic can still gate on policy. That is the whole goal: push the failure surface from model imagination to business policy, where it belongs.

The contract anatomy — eight syntax pieces, four semantics pieces¶

A complete tool contract specifies twelve things across its two halves.

┌──────────────────────────────────────────────────────────────────┐
│  SCHEMA (syntax — what the tool takes)                           │
├──────────────────────────────────────────────────────────────────┤
│  1. properties        — field-by-field types                     │
│  2. required          — which fields must be set                 │
│  3. enums             — closed sets of values                    │
│  4. bounds / formats  — min, max, pattern, format                │
│  5. nesting           — objects inside arrays etc                │
│  6. additionalProperties = false                                 │
│  7. strict mode       — provider-level constrained decoding      │
│  8. field names       — teaching signals in the token stream     │
├──────────────────────────────────────────────────────────────────┤
│  DESCRIPTION (semantics — when to reach for it)                  │
├──────────────────────────────────────────────────────────────────┤
│  9.  tool name        — the routing headline                     │
│  10. one-line purpose — what the tool does                       │
│  11. use when / do not use when — trigger and boundary signals   │
│  12. returns          — output shape for next-step planning      │
└──────────────────────────────────────────────────────────────────┘

Drop any one, the error surface widens. Drop required — half-empty calls arrive. Drop enums — the model invents "late_delivery" when you accept only "delay". Drop bounds — someone refunds 2 crore by accident. Drop additionalProperties: false — hallucinated fields slip in. Drop "do not use when" — two similar tools route at coin-flip odds.

Schema is prompt — field names teach¶

The schema is not separate from the prompt. It is the prompt. Every property name, every enum value, every description string lands in the token stream the model reads.

"amt"           → model guesses what amt means
"amount"        → model writes float dollars, sometimes cents
"amount_cents"  → model writes integer cents — always

The name teaches. Bad names invite bad calls. Good names make the right call the obvious call.

Watch the refund tool tighten in three drafts:

Draft 1 — vague¶

{"name": "refund", "parameters": {"details": {"type": "string"}}}

Model invents: {"details": "Refund order 4481 for ₹1250, delay, notify"}. Schema asked for a string. Model gave a string. Both did their job. The job was the wrong shape.

Draft 2 — typed but unconstrained¶

{
  "name": "issue_refund",
  "parameters": {
    "order_id":        {"type": "string"},
    "amount":          {"type": "number"},
    "currency":        {"type": "string"},
    "reason":          {"type": "string"},
    "notify_customer": {"type": "boolean"}
  },
  "required": ["order_id", "amount", "currency"]
}

Model produces: {"order_id":"4481","amount":1250.00,"currency":"Rs", "reason":"late_delivery","notify_customer":"yes"}. Still wrong in three places. "Rs" is not a currency code. "late_delivery" is not in your enum. "yes" is a string, not a boolean. Types accept it. Business does not.

Draft 3 — constrained¶

{
  "name": "issue_refund",
  "parameters": {
    "order_id":     {"type": "string", "pattern": "^[0-9]{4,10}$"},
    "amount_cents": {"type": "integer", "minimum": 1, "maximum": 20000000},
    "currency":     {"type": "string", "enum": ["INR","USD","EUR"]},
    "reason":       {"type": "string", "enum": ["delay","duplicate","courtesy"]},
    "notify_customer": {"type": "boolean"}
  },
  "required": ["order_id","amount_cents","currency","reason","notify_customer"],
  "additionalProperties": false
}

Now: {"order_id":"4481","amount_cents":125000,"currency":"INR", "reason":"delay","notify_customer":true}. Every field passes a machine check before money moves.

SCHEMA STRENGTH          →  TYPICAL INVALID-CALL RATE
─────────────────────────────────────────────────────
Draft 1 (vague string)      40-60%
Draft 2 (typed loose)       10-20%
Draft 3 (typed strict)      <2%

Description is routing — the label on the belt¶

Schema constrains after selection. Description determines whether selection happens at all. Look at this toolbelt:

┌────────────────────────────────────────────────────┐
│  get_user(id: str)                                 │
│  "Get a user."                                     │
│                                                    │
│  lookup_customer(email: str)                       │
│  "Look up a customer by email."                    │
└────────────────────────────────────────────────────┘

User says: "Find the account for priya@acme.com."

The model calls get_user(id="priya@acme.com"). Schema validates the string. Backend returns 404. The loop wastes a turn.

Sharp labels fix it:

get_user_by_internal_id(id: str)
"Fetch an internal user row by numeric user_id (e.g. u_8821).
 Use only when you already have the internal ID.
 Do not use for email, phone, or external lookups."

lookup_customer_by_email(email: str)
"Find a customer record by their email address.
 Use for any human-typed contact lookup.
 Returns customer_id, plan, created_at."

Same model. Different labels. Now the right tool is obvious.

The four-part description template¶

A description that earns its tokens has four parts:

One-line purpose — what the tool does, in plain language.
Use when — the trigger conditions that signal a fit.
Do not use when — boundaries against the neighbour tools.
Returns — output shape so the model can plan the next step.

[verb] [main object] [scope/qualifier].
Use when: <trigger 1>; <trigger 2>.
Do not use when: <anti-trigger> — use <neighbour> instead.
Returns: <field 1>, <field 2>, <field 3>.

That template runs 40–120 tokens. Under 20 tokens you lose disambiguation. Over 200 tokens you waste context and push other tools out of attention.

Mini-FAQ. "Do examples in the description help?" Yes, when they encode a pattern the schema cannot. "e.g. ord_4481" teaches the ID format. One example per tool is the sweet spot. Three examples per tool across 20 tools pushes real signal out of context.

Strict mode — the schema reliability switch¶

OpenAI calls it strict: true. Anthropic enforces via tool-use validation. Gemini supports schemas with propertyOrdering. Shared idea: the provider guarantees the output matches the schema exactly.

NON-STRICT                       STRICT
─────────                        ──────
model suggests JSON       →      model emits guaranteed JSON
~92-95% call success             ~99.5%+ call success

Roughly 5–7 points gap at the single-call level. Across a five-step loop, non-strict reaches ~78% end-to-end, strict reaches ~97%. The math punishes laxity exponentially.

Catch: Strict mode forbids some schema features (oneOf, overlapping anyOf, optional fields without defaults) and adds ~50–200ms latency for constrained decoding. For state-mutating tools, cheap insurance.

Mid-content recall — predict before reading on¶

In draft 2, three values passed types but broke business. Which three?
Why does amount_cents teach better than amount?
What single description line would keep the model from confusing get_user_by_internal_id with lookup_customer_by_email?

Toolbelt size — where syntax and semantics compound¶

Every tool you add costs every request. Both schema tokens and description tokens stack linearly.

TOOLBELT SIZE    SCHEMA TOKENS    DESC TOKENS    ROUTING ACCURACY
─────────────    ─────────────    ───────────    ────────────────
5 tools          400-600          200-500        ~93%
10 tools         800-1500         500-1000       ~90%
30 tools         3000-5000        1500-3000      ~80%
50 tools         5000-8000        3000-5000      ~70%
80+ tools        10000+           5000+          ~65% (gambling)

Two architectural responses:

Sub-toolbelt routing. A classifier narrows 80 tools to 5–8 per turn. The model only sees the relevant subset. Schema cost drops. Routing accuracy stays high.
Hierarchical tools. A category-level tool dispatches to leaf tools. The model picks "email" from 5 categories, then picks "schedule_email" from 3 options — not from 80.

Past ~50 tools, the fix is architecture, not better descriptions.

Failure modes — where contracts leak¶

Contracts break on both halves. Here are the recurring leaks, unified.

Schema leaks (syntax failures)¶

Weak signal	Fix
`"string"` for typed concepts	enum / format / pattern
No maxLength on text	Set a real ceiling
No required fields	Declare them all explicitly
Free text where states exist	enum
Generic names (`id`, `value`)	Domain-specific names (`order_id`, `amount_cents`)
Silent defaults on money	Make the field required
One tool does five jobs	Split into five tools

Description leaks (semantics failures)¶

Weak signal	Fix
Overlapping descriptions	Name the neighbour tool to defer to
Missing "do not use when"	Add the negative boundary line
Too terse (< 20 tokens)	Minimum 30 tokens with trigger + boundary
Too long (> 200 tokens)	Cap at ~120, push detail into param descriptions
Inconsistent naming (`get_user`, `fetchOrder`, `lookup-item`)	One `verb_object_qualifier` pattern across the belt
Description drift (behaviour changed, label didn't)	Require description changes in same PR as tool changes

The compound failure¶

description drift + loose schema = silent degradation

new tool behavior
       │
       ▼
old description stays  ──→  wrong tool picked sometimes
       │
       ▼
loose schema accepts garbage  ──→  no validation alarm
       │
       ▼
bad call reaches production  ──→  2 a.m. page

The tension — expressiveness vs correctness¶

Richer schemas let the model do more. Richer schemas also give it more ways to pass garbage. Better descriptions improve routing but can mislead if they promise what the tool can't deliver.

Schema side: - Nested objects let you express complex inputs — but deep nesting confuses models and explodes token cost. - Many enums give the model precise targets — but stale enums (a value removed from the backend but still in the schema) cause silent failures. - Tight bounds prevent catastrophe — but over-tight bounds reject valid edge cases and force retry loops.

Description side: - Rich trigger phrases route ambiguous prompts — but trigger phrases that over-promise (describing what the tool should do but doesn't yet) cause the model to call a tool that fails. - Worked examples teach patterns — but examples that show happy-path only mislead the model about error cases. - "Do not use when" is high-leverage — but incorrect negatives (telling the model not to use a tool in a case where it should) create blind spots.

The rule: tighten until the worst valid call is harmless, but no tighter. Validate your contract against the prompts that actually arrive, not the prompts you hope for.

Schema evolution — version, don't mutate¶

Tool schemas change. New business rules add fields. Old enum values die. The contract must evolve without breaking running agents that cached the old shape.

Rules: - Additive only. Add optional fields, new enum values, broader bounds. - Never rename, retype, or remove. A renamed field breaks every cached plan that uses the old name. - For breaking changes, version the tool. issue_refund_v2 with a new schema. Migrate callers. Deprecate issue_refund with a "do not use" description update.

SAFE EVOLUTION                    BREAKING (version instead)
──────────────                    ──────────────────────────
add optional field                remove a field
add new enum value                rename a field
widen a bound (max 100→200)       change a field's type
add "returns" info to desc        narrow a bound (max 200→100)

Where this lives in the wild¶

Same JSON Schema spine, same description-routing pattern. Many surfaces.

OpenAI function calling (strict: true) — constrained decoding on schema; description explicitly called out as the dominant routing lever.
Anthropic Claude tool use — input_schema + <description> block; guidance recommends boundary statements.
Google Gemini function declarations — OBJECT schema with propertyOrdering; description is the routing signal.
Model Context Protocol (MCP) — every tool ships name, description, inputSchema; clients route purely on these strings.
OpenAI Agents SDK / Pydantic AI — schema from type hints, description from docstrings. Bad docstring = bad routing.
LangChain / LlamaIndex / Vercel AI SDK — description field is what the agent executor ranks against user intent.
BAML / Instructor — typed DSL or Pydantic models compile to provider schemas with validation native.
Cursor / GitHub Copilot / Replit Agent — codebase tools routed via description; ambiguity causes "wrong file edited" failures.
Berkeley Function-Calling Leaderboard (BFCL) — public benchmark for schema adherence and routing accuracy across models.

Pause and recall¶

Why is a schema part of the prompt, not separate from it?
Name the eight syntax pieces a strong schema specifies.
What is the four-part template for a tool description?
Which field name change taught the refund model to emit integers?
What is the typical reliability gap between strict and non-strict modes?
Why is "do not use when" the highest-leverage description line?
At what toolbelt size does flat routing start to collapse?
What is description drift, and where in your workflow do you catch it?

Interview Q&A¶

Q1. How do you prevent hallucinated tool arguments? A. Three layers. Schema — strict types, enums, bounds, additionalProperties: false reject shape errors. Provider — strict mode or constrained decoding guarantees schema-conformant output. Runtime — the tool rejects business policy violations the schema cannot encode. Common wrong answer: "Just use a stronger model." Stronger models still hallucinate when the schema lets them.

Q2. The model picked the wrong tool. Schema validation passed. Root cause? A. The description, not the schema. Schemas only see arguments after routing; routing is driven by names plus descriptions. Look for two tools with overlapping descriptions and missing "do not use when" boundaries. Fix by writing each description in terms of when it fits and which neighbour to defer to. Common wrong answer: Blaming model size or temperature.

Q3. When should you turn on strict mode? A. Always for tools that mutate state — money, permissions, code, deploys. The 5–7 point reliability gain compounds across multi-step loops. Trade-off is small latency cost and a few unsupported schema features. Common wrong answer: Assuming strict mode catches business errors. It only catches shape.

Q4. How long should a tool description be? A. 40–120 tokens for most tools. Under 20 loses disambiguation; over 200 wastes context and pushes other tools out of attention. Aim for one-line purpose, one use-when, one do-not-use-when, one returns. Common wrong answer: "As long as needed." It is bounded by attention budget.

Q5. You have 80 tools. Routing accuracy has collapsed. What do you change? A. Architecture, not descriptions. Introduce a router stage that selects a sub-toolbelt of 5–10, then run normal selection on the subset. Past ~50 tools, the issue is toolbelt shape. Common wrong answer: "Rewrite the descriptions harder."

Q6. How do you evolve a tool schema without breaking running agents? A. Additive only — add optional fields, new enum values, broader bounds. Never rename, retype, or remove. For breaking changes, version the tool (issue_refund_v2) and migrate callers. Common wrong answer: Renaming in place because the new name is "clearer."

Q7. Why are enums better than free-text for known state values? A. They collapse infinite string space to a finite set. The model hits the right value more often, the runtime validates faster, the business evolves without ambiguity. Free text invites synonyms — "urgent", "high", "P0" — that explode downstream. Common wrong answer: Adding enums for unstable sets that change weekly.

Q8. How would you measure description quality? A. Build a labelled set of 30–100 representative prompts with the correct first tool annotated. Run the agent, log first-call tool selection, compare. Description work without an eval set is taste, not engineering. Common wrong answer: "Ask the model if the description is clear."

Apply now (10 min)¶

Step 1 — draft three schemas. Pick a cancel_subscription tool. Draft 1: {details: string}. Draft 2: subscription_id: string, cancel_at: string, reason: string. Draft 3: subscription_id: pattern ^sub_[A-Za-z0-9]{8,}$, cancel_at_unix: integer minimum now_unix, reason: enum ["price","missing_feature","switching","other"], confirm: const true, additionalProperties: false. For each, write the worst valid call.

Step 2 — write the description. Using the four-part template, write the description for cancel_subscription that disambiguates it from issue_refund and pause_subscription. Include one concrete trigger phrase.

Step 3 — evaluate routing. Write 10 user prompts where the intent is ambiguous between the three tools. For each, predict which tool the model would select given your descriptions. Flag any prompt where two tools score equally — that is where you need a sharper "do not use when" line.

Step 4 — sketch from memory. Draw the full contract (schema + description) for one tool as a two-column diagram: constraints on the left, guaranteed behaviour on the right.

Operational memory¶

This chapter explained that a tool's schema and its description are two halves of one contract — the typed interface between model and world. The schema is the syntax (what shape of argument the world accepts); the description is the semantics (when the model should reach for the tool). Both land in the same token stream the model reads. Both are load-bearing.

You learned the eight-part schema anatomy (properties, required, enums, bounds, nesting, additionalProperties: false, strict mode, field names as teaching signals) and the four-part description template (purpose, use-when, do-not-use-when, returns). The key numbers: strict mode buys 5–7 points per call and compounds across loops; 40–120 tokens is the description sweet spot; routing degrades past ~50 tools regardless of description quality.

The central tension is expressiveness vs correctness — richer contracts let the model do more but give it more ways to fail. The rule is: tighten until the worst valid call is harmless, but no tighter.

Carry this diagnostic forward: when a tool fires with wrong arguments and validation passed, the failure is in the schema. When the wrong tool fires and schema validation passed, the failure is in the description. Fix the contract, not the model.

Remember:

Schema is prompt — field names, enums, and bounds teach as much as any instruction text.
Description is routing — "do not use when" is the highest-leverage line because routing is comparison and boundaries break ties.
Strict mode buys 5–7 points per call, compounds across multi-step loops; turn it on for state-mutating tools.
additionalProperties: false blocks hallucinated keys; required fields block half-empty calls; enums collapse infinite string space.
40–120 tokens per description is the sweet spot; under 20 loses disambiguation, over 200 wastes context.
Past ~50 tools, fix architecture (sub-toolbelt routing), not descriptions.
Evolve schemas additively only; for breaking changes, version the tool.
Descriptions must live in the same source file as the tool implementation; otherwise they drift silently.

Bridge. A single tool contract is clear — one schema, one description, one clean call. But what happens when one tool's output feeds another's input, or when three tools can fire at once? The contract between model and world gets harder when tools compose. Next: how tool calls chain, race, and interfere.

→ 04-tool-composition.md