Skip to content

05. Error contracts the model can recover from

Idempotency makes retries safe. Errors decide whether a retry should happen at all — and what should happen instead if not. This chapter teaches you to shape errors as data the model can act on, not exceptions it can only parrot.


A senior support engineer at a Chennai logistics company runs a weekly review of agent transcripts and finds a pattern. When the agent tries to update a customer's delivery address and the customer ID is wrong, the user sees the message: "I'm sorry, I encountered an error: salesforce.exceptions.SalesforceMalformedRequest: INVALID_FIELD: No such column 'addr1' on entity 'Customer__c'. Please contact support." The customer's actual problem is that they typed the wrong order number into the chat. The fix is one clarifying question. Instead the agent ends the conversation, the customer escalates to a human, and a metric called "first-contact resolution" drops a percentage point. There are forty-seven of these incidents in the week's sample. None of them are model bugs. The model did the right thing with the information it had: it cannot recover from a structured error if no one gave it one.

The model is the client deciding whether to retry, ask for clarification, switch tools, or surface to the user. That decision is only as good as the error shape the contract returned. A stack trace gives the model nothing to act on. A structured error with a code, a retriability flag, a human-facing hint, and a model-facing corrective action turns the same failure into a recoverable moment.


What an error contract is, in one sentence

The error contract is the typed, enumerable set of failure shapes a tool can return, designed so the model can branch on the shape without parsing free text.

The italicised words are doing the work:

  • Typed — each error has a code from a closed enum, not a free-form string.
  • Enumerable — the full set of errors is documented in the contract; the model is not surprised by a new shape at runtime.
  • Branch on shape — the model's next action is decidable from the code alone, not from interpreting a message.
  • Without parsing free text — the message is for the human reader of logs; the code is for the model.

If you only remember one rule from this chapter: codes are for the machine, messages are for the human. The model reads the code.


The four parts of an error response

Every error a contract returns must carry these four parts. They are the minimum.

Part Audience What it is
code Model Stable identifier from the contract's error enum
retriable Contract layer + model Explicit boolean: should anyone retry this call?
human_hint The customer (via the model) Plain sentence the model can surface or adapt
model_action Model What the model should do next: retry, ask user, switch tool, escalate

Two further parts are common:

  • retry_policy (only when retriable=true) — the strategy: backoff, max attempts, idempotency-required
  • fields (optional) — structured detail useful for the model's next call: which field was invalid, what the upstream system suggested, etc.

The minimum response shape:

{
  "ok": false,
  "error": {
    "code": "AMOUNT_EXCEEDS_REFUNDABLE",
    "retriable": false,
    "human_hint": "The refund amount is more than what's left on this payment after prior refunds.",
    "model_action": "Use get_payment to read refundable_balance, then re-call issue_refund with amount_minor <= refundable_balance.",
    "fields": {
      "requested_amount_minor": 5000000,
      "refundable_balance_minor": 2300000
    }
  }
}

A model receiving this response knows: do not retry blindly; the right next step is to call get_payment; and the user-facing wording, if any, is the human hint.


The error taxonomy

Every error a tool can return falls into one of a small number of taxonomic categories. Use this taxonomy when you write the error enum for a new tool; it forces you to think about coverage.

Category What it means Retriability Typical model action
InputError The arguments the model passed are invalid for the contract or for the underlying system Not retriable with same args Re-derive args; ask user for clarification; switch tool
NotFoundError A resource referenced in the arguments does not exist Not retriable with same args Lookup; ask user for correct identifier
PermissionError The call's scope or tenant binding does not authorise this operation Not retriable Surface; escalate; do not retry
PreconditionError The system state does not allow this operation right now (e.g., "payment already refunded") Not retriable with same args Read current state; explain to user; switch flow
ConflictError A concurrent change made the operation no-longer-valid (e.g., optimistic-lock failure) Retriable after reading state Refresh state; re-decide; potentially retry
RateLimitError The caller has exceeded a quota; the contract or downstream system refused Retriable after backoff Wait; surface "please wait" to user if material
TransientError A temporary failure unrelated to the request: network blip, timeout, dependency degraded Retriable with backoff Contract layer retries internally; surface only if persistent
UpstreamError The downstream system returned an error the contract cannot classify further Depends — contract author must decide and document Usually surface and escalate
PolicyError A platform policy refused the call (e.g., this scope is disabled, this tool is in cooldown) Not retriable Surface to user; log for ops
InternalError The contract layer itself failed (bug in validation, dedup-store unreachable) Treat as retriable but with caution Contract layer retries; if persistent, escalate to on-call

Coverage check: when you draft a new tool's errors, walk this table. For each category, ask: can this happen on this tool? If yes, document an error code in that category. Tools whose error enums skip a category whose failures are realistic will produce uncoded errors at runtime, which the model handles badly.


Retriability is a property of the error, not the policy

A common confusion: "retriable" depends on whether the caller is in a hurry. It does not. Retriability is a property of the failure — does retrying the same operation with the same arguments stand any chance of producing a different outcome?

The rule:

A failure is retriable if and only if the cause is independent of the arguments and is plausibly transient.

By the rule:

  • NotFoundError is not retriable. The resource will not appear on its own.
  • InputError is not retriable. The validator's verdict will not change.
  • PreconditionError is not retriable. The system state did not become "ok" because you asked twice.
  • RateLimitError is retriable, with the wait time the contract specifies.
  • TransientError is retriable, with the backoff the contract specifies.
  • UpstreamError is the trap. By default, treat it as not retriable. Only mark it retriable if the contract author has read the upstream system's docs and verified the cause is genuinely transient. Marking unknown upstream errors retriable is how DDoS-yourself incidents start.

The contract layer enforces retriability for its own internal retries: it will only retry errors whose retriable=true. The model also reads retriable; a well-tuned agent loop does not re-call a tool that returned retriable=false.


human_hint and model_action are different

The most common mistake in error contracts is collapsing these two fields into one message. They serve different readers.

human_hint is the sentence that, if the model decides to surface anything to the user, gives the user a useful answer. It must be:

  • In the user's language register (not platform jargon).
  • Specific enough to be actionable ("the order number you gave does not match any of your recent orders" beats "lookup failed").
  • Free of internal identifiers, internal system names, or stack-trace fragments.

model_action is the instruction the model receives about what to do next. It must be:

  • Specific to the tools available on this agent. ("Use get_payment then re-call issue_refund.")
  • Bounded — not "try harder", but a concrete sequence.
  • Explicit about when not to retry. ("Do not retry without first reading the updated payment state.")

Side by side:

- code: ORDER_NOT_FOUND
  retriable: false
  human_hint: |
    I couldn't find that order number. Could you double-check it from your
    order confirmation email?
  model_action: |
    Ask the user to confirm the order number. Do not retry. If the user
    insists the number is correct, use search_orders_by_email to lookup
    by the customer's email address.

The user sees "I couldn't find that order number." (the human hint, possibly adapted). The model executes the model_action. The two are designed independently for their two audiences.


The structured error in code

The contract layer's error type carries all four parts. A concise Python sketch:

@dataclass(frozen=True)
class ToolError:
    code: str
    retriable: bool
    human_hint: str
    model_action: str
    fields: dict | None = None
    retry_policy: RetryPolicy | None = None

    def to_model(self) -> dict:
        return {
            "ok": False,
            "error": {
                "code": self.code,
                "retriable": self.retriable,
                "human_hint": self.human_hint,
                "model_action": self.model_action,
                "fields": self.fields,
            },
        }

Tool-specific error enums extend this:

class RefundErrors:
    PAYMENT_NOT_FOUND = ToolError(
        code="PAYMENT_NOT_FOUND",
        retriable=False,
        human_hint="I couldn't find that payment.",
        model_action=(
            "Ask the user for their order number; use lookup_payment_by_order "
            "to find the correct payment_id."
        ),
    )

    AMOUNT_EXCEEDS_REFUNDABLE = ToolError(
        code="AMOUNT_EXCEEDS_REFUNDABLE",
        retriable=False,
        human_hint="The refund amount is more than what's left on this payment.",
        model_action=(
            "Use get_payment to read refundable_balance, then re-call with "
            "amount_minor <= refundable_balance."
        ),
    )

    UPSTREAM_TIMEOUT = ToolError(
        code="UPSTREAM_TIMEOUT",
        retriable=True,
        human_hint="The refund is processing. Please wait a moment.",
        model_action="Contract layer will retry. If retries fail, surface as pending.",
        retry_policy=RetryPolicy(strategy="exp_backoff", base_ms=500, max_attempts=3),
    )

The point: errors are constructed objects, not raised strings. The contract returns them by value. The model receives a JSON object with a fixed shape.


Translating downstream errors to the contract's enum

The biggest source of badly-shaped errors is naïve passthrough from the underlying system. The downstream system raises salesforce.exceptions.SalesforceMalformedRequest. The contract layer catches it, looks at the body, and produces one of the contract's enum values.

def _translate_salesforce_error(e: SalesforceMalformedRequest) -> ToolError:
    body = e.content[0] if e.content else {}
    code = body.get("errorCode")

    if code == "REQUIRED_FIELD_MISSING":
        field = body.get("fields", [""])[0]
        return LeadErrors.MISSING_FIELD.with_fields({"missing_field": field})

    if code == "DUPLICATE_VALUE":
        return LeadErrors.DUPLICATE_LEAD

    if code in {"INVALID_EMAIL_ADDRESS", "INVALID_FIELD"}:
        return LeadErrors.INVALID_INPUT.with_fields({"detail": body.get("message", "")})

    # The escape hatch — but it must be logged loudly.
    log.warning("unmapped_salesforce_error", code=code, body=body)
    return LeadErrors.UPSTREAM_UNCLASSIFIED.with_fields({"upstream_code": code})

Two rules:

  1. Every downstream error code that has been seen in production gets a mapping. New codes are an opportunity to extend the enum.
  2. UPSTREAM_UNCLASSIFIED is a "to-do" code. It triggers a monitor; a high rate means the translator is out of date.

How to design the error enum for a new tool

When you draft a new tool, write the error enum before the schema. The reason is that the enum forces you to confront the failure modes, and the failure modes usually expose schema gaps.

Procedure:

  1. Walk the taxonomy table above. For each category, ask "can this happen on this tool?" Write a placeholder code if yes.
  2. For each downstream system call, list every error code the downstream documents. Map each to one of your enum codes.
  3. For each enum code, write retriable, human_hint, model_action. The first two are usually quick; model_action forces you to think about what the agent platform actually offers as a recovery path.
  4. Add the catch-all UPSTREAM_UNCLASSIFIED for codes you have not yet mapped, with retriable=false by default.
  5. Add INTERNAL_ERROR for contract-layer failures (validation crash, dedup store unreachable).

A common smell: the enum has fewer than four codes. Real systems usually have eight to fifteen.


What the model sees, and how it reads it

The model platform receives the error response as the result of the tool call. The standard agent loop pattern (module 01 chapter 02) feeds the result back into the model's context. With a structured error, the next prompt the model sees includes:

Tool issue_refund returned:
{
  "ok": false,
  "error": {
    "code": "AMOUNT_EXCEEDS_REFUNDABLE",
    "retriable": false,
    "human_hint": "The refund amount is more than what's left on this payment.",
    "model_action": "Use get_payment to read refundable_balance, then re-call with amount_minor <= refundable_balance.",
    "fields": { "requested_amount_minor": 5000000, "refundable_balance_minor": 2300000 }
  }
}

A well-prompted agent will:

  1. Branch on code — recognise this is AMOUNT_EXCEEDS_REFUNDABLE.
  2. Read retriable: false — do not retry blindly.
  3. Read model_action — call get_payment first.
  4. Read fields — note the exact balance available.
  5. If surfacing to the user, adapt human_hint into the conversation tone.

The contract author cannot guarantee the model will do this — that is a prompt-engineering and eval responsibility (module 13, prompt lifecycle, and module 04_ai_product_evals). But the contract makes correct behaviour possible. With a stack trace, correct behaviour is not possible.


How to recognise broken error contracts in the wild

  • The error response is a raw exception string from the downstream system
  • The model surfaces internal field names (addr1, lead_source) or system names (Salesforce, payments-svc) to users
  • The agent retries on failures that have no chance of succeeding (NotFound, InvalidInput)
  • The agent gives up on failures that should have been transparently retried (network timeouts)
  • A new failure shape from the downstream system breaks the agent silently
  • UPSTREAM_UNCLASSIFIED rate is non-trivial and no one is paged about it

If you see any of these, the error contract is missing or the translator is out of date.


Interview Q&A

Q1. Why is retriable an explicit field on every error rather than something the contract layer infers from the code? Because retriability is the most consequential bit, and inference is where bugs hide. An explicit retriable: false is a contract; the model and the contract layer's retry logic both read it the same way. Inferring from the code means two readers (model, retry policy) might disagree, and inference logic drifts as new codes are added. Explicit is auditable; inferred is not. Wrong-answer notes: "for clarity" is too vague — the specific value is making the retry decision machine-readable and consistent across consumers.

Q2. Walk through how you would design the error enum for a new send_sms tool. Start with the taxonomy. InputError → INVALID_PHONE_NUMBER, MESSAGE_TOO_LONG. NotFoundError → not applicable usually. PermissionError → RECIPIENT_OPTED_OUT. PreconditionError → RECIPIENT_BLOCKED. RateLimitError → RATE_LIMIT_EXCEEDED (often returned by SMS provider). TransientError → PROVIDER_TIMEOUT. UpstreamError → PROVIDER_REJECTED (carrier rejected — non-retriable), PROVIDER_UNCLASSIFIED (catch-all). PolicyError → MESSAGE_BLOCKED_BY_POLICY (e.g., your platform's content filter). InternalError → INTERNAL_ERROR. Then for each, write retriable, human_hint, model_action. Wrong-answer notes: jumping straight to "happy path validation errors" without walking the taxonomy is what produces incomplete enums.

Q3. A teammate suggests collapsing human_hint and model_action into one field "to keep the contract small." What is your pushback? The two fields have two different audiences. The human hint must be in user-facing language, free of internal jargon, suitable for surfacing in chat. The model action must reference specific tools and concrete next steps that the model can execute. Collapsing them means either the user sees model-facing language ("call get_payment with...") or the model receives user-friendly text it cannot act on. The "small contract" argument trades clarity for byte count; the trade is not worth it because errors are infrequent relative to successes, so contract size on the error path is not a hot field. Wrong-answer notes: agreeing to collapse, or treating this as a stylistic preference, misses that the two fields serve mechanically different consumers.

Q4. The downstream system starts returning a new error code your translator doesn't know. What does the contract layer return, and what alerts fire? The contract returns the catch-all (UPSTREAM_UNCLASSIFIED) with retriable=false, human_hint saying the operation could not be completed, and model_action saying to escalate or ask the user to retry later. Simultaneously, a monitor counts the rate of UPSTREAM_UNCLASSIFIED returns; a sustained non-zero rate pages the contract owner because it means the downstream has shipped a change. The fix is to extend the translator's mapping. Wrong-answer notes: "let the error bubble" is the failure mode; without the catch-all, the model receives uncoded errors and the operator receives no signal.


What to do differently after reading this

  • For every existing tool, audit the error responses. Find the ones that pass through SDK exceptions; rewrite them as structured.
  • Build the catch-all UPSTREAM_UNCLASSIFIED for every tool, and wire a monitor on it.
  • Make retriable: false the default; the author has to argue for retriability case by case.
  • In design reviews, demand the error enum before the schema. The enum constrains the schema, not the other way around.

Bridge. Error contracts decide what the model sees when something fails. The next surface is what credentials the tool runs under — because the most expensive error is a successful call made under a credential broader than the operation required. The next chapter builds scope and credential isolation: one tool, one tenant, one purpose, no god-keys. → 06-scopes-and-credential-isolation.md