Skip to content

04. Idempotency and retry safety

Class decides whether you need idempotency. This chapter builds the mechanism: where the key comes from, where it lives, how long it lives, what happens on retry, and the edge cases that ruin naïve implementations.


A payments engineer at a Hyderabad e-commerce company gets paged on a Sunday because a customer was refunded ₹14,000 four times in ninety seconds. The agent had been asked to issue a refund. The first call succeeded and the agent crashed before it received the response, due to a load balancer eviction during a deploy. The orchestrator restarted, replayed the conversation, and the agent re-decided to issue the refund. The tool wrapper had an idempotency key — but the key was generated inside the wrapper, freshly, on each call. So the second attempt was a brand new key. The third attempt, after another deploy retry, was a third brand new key. The fourth was the fourth. Four ₹14,000 refunds. The audit log was pristine; each call was perfectly recorded as a separate, legitimate, non-duplicate request.

The root cause is the most common idempotency bug in agent systems: the key was generated by the wrong actor. Idempotency keys generated downstream of the retry boundary do not protect against retries above that boundary. The key must be generated by the outermost layer that can decide to retry.

This chapter teaches you to place the key correctly, store it correctly, and design the dedup window so retries become no-ops instead of duplicates.


What idempotency is, and what it is not

An idempotent operation is one where calling it twice with the same key produces the same result as calling it once. The same outcome; the same return value; the same audit trail on a per-key basis.

Idempotency is not:

  • A retry policy. Retry policy decides when to call again; idempotency decides what happens when you do.
  • An exactly-once guarantee. It is "at-least-once delivery with safe replay" — the call may run more than once, but only one effect lands.
  • A property of a tool. It is a property of a tool and a key strategy combined. The same tool can be idempotent under one key strategy and non-idempotent under another.

The unit of guarantee is the key. If you can generate stable, unique-per-logical-action keys, you can make most non-idempotent operations safely retriable. If you cannot, no amount of contract design will help.


Where the key must come from

The cardinal rule:

The key is generated by the outermost layer that can decide to retry.

In an agent system, the retry decision can be made at four layers, from outermost to innermost:

  1. The orchestrator / workflow engine — decides to replay a step (this is where the Sunday incident happened)
  2. The agent runtime / planner — decides to re-call the same tool after seeing an ambiguous response
  3. The contract layer — decides to retry on a transient error (UPSTREAM_TIMEOUT)
  4. The HTTP client inside the contract layer — retries on a TCP connect failure

Each layer can independently decide to retry. The key has to be generated above the topmost retrying layer for the same logical action.

In practice, for most agent systems, this means:

The key is generated at the step boundary of the workflow / agent loop, persisted in the workflow state, and passed down through every retry at every lower layer.

If the workflow engine retries a step, it must reuse the same key. If the agent loop re-invokes the tool, it must reuse the same key. If the contract layer retries on a transient error, it must reuse the same key. If the HTTP client retries, it must reuse the same key. All four layers retry against one key, and exactly one effect lands.

What this rules out:

  • Generating the key inside the tool wrapper (the Sunday incident).
  • Generating the key from the model's output (the model can produce different keys for the same logical action on a re-prompt).
  • Generating the key from uuid4() at any layer below the workflow step boundary.

What it permits:

  • Workflow engine generates a key per step, persists it, passes it down. This is the right pattern in durable-workflow systems (02_durable_agent_workflows covers this from the workflow side; this chapter covers it from the tool side).
  • A non-durable agent (single-process, in-memory) uses a key derived from (conversation_id, step_index, tool_name). The hash is stable across crashes if the conversation state is persisted.

What the key should hash over

A good idempotency key has three properties: it is unique per logical action, stable across retries of that action, and meaningless for any other action.

For most agent tools, a key derived from these inputs covers the cases:

key = hash(
    conversation_id     // which conversation / session
  , step_id             // which logical step within the conversation
  , tool_name           // which tool was chosen at this step
  , canonical(arguments) // a stable serialisation of the call arguments
)

The arguments hash is the part that ruins naïve implementations. Two retries of the same logical action may have slightly different arguments (e.g., a timestamp field updated on retry). The canonicalisation step is where you strip out fields that the model regenerates on retry but that do not change the logical effect.

Typical strip list:

  • Free-text fields the model regenerates (e.g., a reason string)
  • Timestamps the model includes for "freshness"
  • Trace IDs and correlation IDs at lower layers

Typical preserve list (these must be in the hash):

  • Resource IDs (which payment, which customer, which lead)
  • Amounts and quantities
  • Destinations (recipient, address)
  • Enum choices (reason_code, status, channel)

Get this wrong and you have two failure modes:

  • Over-canonicalised — two genuinely different actions get the same key. Dedup eats the second one. Lost work.
  • Under-canonicalised — one logical action retries and gets different keys. No dedup. Duplicate effect.

A practical heuristic: hash over the fields the underlying system would consider distinguishing. If Salesforce would consider two lead-create calls "the same lead", the hash should produce the same key for them.


Where the key lives

The contract layer maintains a dedup store. On every incoming call:

  1. Compute or accept the key.
  2. Look up the key in the dedup store.
  3. If found, return the stored result without re-executing.
  4. If not found, execute, record (key → result) atomically with the side effect, and return.

The atomicity in step 4 is the load-bearing detail. If you record the key first and crash before the side effect, you will return the recorded "success" to a future retry without ever having done the work. If you record the side effect first and crash before storing the key, the next retry will execute the side effect again. Atomicity has two reasonable implementations:

Option A — Atomic on the downstream system. Pass the key to the downstream system as its own idempotency parameter (Stripe-style). The contract layer just records what it received; the downstream system handles dedup. This is the cleanest design when the downstream system supports it.

Option B — Two-phase record in the contract layer. Reserve the key in a "pending" state with a short TTL, execute the side effect, then transition the record to "complete". A retry seeing "pending" can wait; a retry seeing "complete" returns the stored result. A pending record older than its TTL is treated as a failure and the side effect can be retried.

Avoid:

  • A separate "dedup table" updated after the side effect, with no atomicity. This is the silent-duplicate factory.
  • A dedup table with eventual consistency on lookup. Two concurrent retries both miss the key and both execute.

The dedup window

A dedup record cannot live forever (storage cost, key collision over time). Each tool has a window during which the key is honoured. After the window, the key is forgotten.

Picking the window is a tradeoff between two costs:

  • Window too short — a slow retry (network blip, system pause) lands after the window and the side effect executes again.
  • Window too long — storage cost grows; key reuse across unrelated actions becomes possible.

Reasonable defaults by class:

Class Dedup window Reasoning
write-idempotent Forever (resource ID is the key) The resource ID is the dedup; the window is bounded by the resource's lifetime
write-non-idempotent (low value) 24h Most retries land within minutes; a day covers crashes and deploy windows
write-non-idempotent (high value) 7d Refunds, charges; cover a long incident, weekend support cycle
irreversible 7d–30d Match the recovery window; longer if regulatory

The window is documented in the contract (chapter 02 covered the slot). If your downstream system has a shorter window than your contract claims, the contract is lying. Test this in the contract-testing layer (chapter 10).


What the contract layer does on retry

The full sequence, including retry decisions inside the contract layer itself.

1. Receive call (key, args, scope, version).
2. Validate schema, scope, version.
3. Look up key in dedup store.
   - If complete: return stored result; emit audit "dedup_hit".
   - If pending and within TTL: wait or return "in flight" error.
   - If pending and beyond TTL: treat as failed; proceed to execute.
   - If not present: proceed to execute.
4. Reserve key as "pending" with TTL.
5. Execute side effect against downstream system.
   - Pass key to downstream if it supports idempotency.
   - On transient error: retry up to N times *with the same key*.
   - On non-retriable error: record key as "complete-failed" with the error.
6. Record (key → result) atomically. Transition to "complete-success" or
   "complete-failed".
7. Emit audit.
8. Return result.

The non-obvious bits:

  • Step 3 "pending and beyond TTL" is the recovery hatch for cases where step 4–6 crashed midway. The pending TTL must be longer than the realistic execution time of the tool, with margin.
  • Step 5's internal retries reuse the key. This is the second place agents get this wrong: the contract layer's HTTP client generates a new key on each retry. The contract layer must inject the key.
  • Step 6's "complete-failed" record is important. A failed call should not be retried after the dedup window unless the caller decides to (with a new key for a new logical action). Storing the failure prevents accidental retry-on-replay.

Concrete examples

Refund (write-non-idempotent)

@tool_contract(name="issue_refund", version="2.1.0", class_="write-non-idempotent")
def issue_refund(call: ToolCall) -> ToolResult:
    args = RefundSchema.validate(call.arguments)
    key = call.idempotency_key  # generated by workflow, passed through

    record = dedup.lookup(key)
    if record and record.status == "complete":
        return record.result  # exact replay
    if record and record.status == "pending" and not record.expired:
        return ToolResult.error("RETRY_IN_FLIGHT", retriable=True)

    dedup.reserve(key, ttl=300)  # 5 min execution window
    try:
        result = payments_client.refund(
            payment_id=args.payment_id,
            amount_minor=args.amount_minor,
            idempotency_key=key,  # passed to downstream
        )
        dedup.complete(key, status="success", result=result, retention="24h")
        return ToolResult.ok(result)
    except UpstreamTimeout:
        # let outer layer retry; do NOT release the reservation
        return ToolResult.error("UPSTREAM_TIMEOUT", retriable=True)
    except NonRetriable as e:
        dedup.complete(key, status="failed", error=e, retention="24h")
        return ToolResult.error(e.code, retriable=False)

Email send (write-non-idempotent, lower stakes)

@tool_contract(name="send_email", version="1.3.0", class_="write-non-idempotent")
def send_email(call: ToolCall) -> ToolResult:
    args = EmailSchema.validate(call.arguments)
    key = call.idempotency_key

    if dedup.lookup(key, status="complete"):
        return dedup.lookup(key).result  # idempotent replay

    dedup.reserve(key, ttl=60)
    result = mailer.send(
        to=args.to, template=args.template, params=args.params,
        message_id=key,  # passed to SMTP provider as Message-ID
    )
    dedup.complete(key, status="success", result=result, retention="24h")
    return ToolResult.ok(result)

Upsert customer (write-idempotent — resource ID is the key)

@tool_contract(name="upsert_customer", version="3.0.0", class_="write-idempotent")
def upsert_customer(call: ToolCall) -> ToolResult:
    args = CustomerSchema.validate(call.arguments)
    # no separate idempotency_key — the resource ID is the dedup
    result = customers_client.upsert(
        customer_id=args.customer_id,
        fields=args.fields,
    )
    return ToolResult.ok(result)

For class 2 the dedup is implicit: the downstream system's upsert semantics handle it.


Edge cases that ruin naïve implementations

The agent re-decides on retry. Workflow replays the step. The model is re-prompted with the same context. The model decides to call a different tool this time (e.g., "actually, ask the user first"). The original tool was never called. The new tool runs with a new key. This is the correct behaviour — the workflow's idempotency was at the step level, not at the tool-call level. The contract is preserved; the agent simply made a different choice. Make sure your design distinguishes "the same step" from "the same tool call."

The model passes a stale key. If the model is exposed to the idempotency key (e.g., it appears in tool descriptions), the model can include the same key in a new call meant to be a new action. The contract layer sees the key as a dedup hit and returns the old result, surprising the agent. Fix: do not put the key in the model-visible surface. The key is filled in by the contract layer or the workflow runtime, not by the model.

The downstream system's dedup is shorter than yours. You record the key for 24h; the downstream system honours it for 1h. After 1h, a retry with the same key creates a duplicate. The downstream truth wins. Test it (chapter 10).

Partial success. The tool call performs three writes; the second crashes. Was the call idempotent? Only if the downstream system can replay the partial state on retry. If not, the contract has to flag this in the response as PARTIAL_SUCCESS and provide a corrective action; the model cannot blindly retry.

Two retries arrive in parallel. Both miss the dedup lookup (because neither has completed) and both execute. The fix is the "pending" record with a TTL: the second sees pending, waits (or returns RETRY_IN_FLIGHT), and the first finishes. Without the pending record, race-on-retry produces silent duplicates.

The dedup store is the bottleneck. Every call is a read-then-write on the dedup store. At scale, this is real load. The dedup store has to be highly available; if it goes down, every write-non-idempotent and irreversible tool stops. Treat it as a tier-zero dependency. Cache reads aggressively (within the window). Use a partition strategy that survives single-shard failure.


How to recognise idempotency broken in the wild

Sample of observable symptoms:

  • Customer support tickets with "I was charged twice"
  • Duplicate rows in CRMs around deploy windows
  • Notifications received in pairs (two of the same email, two of the same SMS)
  • Spikes in downstream-system 4xx rates after agent platform restarts
  • Audit logs that show two contract-layer calls within seconds, with different keys, same arguments

If you see any of these, the failure is almost always one of three things: key generated in the wrong layer; dedup window shorter than the retry interval; or no atomicity between side effect and dedup record.


Interview Q&A

Q1. Why does the idempotency key need to come from the outermost layer, not from the tool wrapper? Because every layer above the wrapper can decide to retry, and a key generated by the wrapper is fresh on each retry. If the workflow engine replays a step, the agent re-decides and re-calls the tool, and the wrapper generates a new key, then the dedup store has no way to recognise the second call as a retry of the first. The key must be generated above the topmost retrying layer for the logical action — typically at the workflow step boundary — and passed down. Wrong-answer notes: "the wrapper is closest to the call so it should generate the key" is the intuitive but wrong answer; the question is about which layer's retries the key must protect against.

Q2. What is wrong with recording the dedup key after the side effect lands? Crash between the side effect and the record. The side effect is done; the dedup store does not know. The next retry will re-execute the side effect, producing a duplicate. The fix is two-phase: reserve the key as "pending" with a TTL before the side effect, then transition to "complete" after; alternatively, pass the key to the downstream system as its own idempotency parameter and let it handle atomicity. Wrong-answer notes: "use a transaction" is too vague; the discussion is about the boundary between two systems (contract layer and downstream) that don't share a transaction.

Q3. How do you choose what fields to include in the idempotency key hash? Include fields that distinguish logical actions — resource IDs, amounts, destinations, enums, anything the downstream system would consider semantically different. Exclude fields the model regenerates on retry but that don't change the effect — free-text reasons, timestamps for freshness, trace IDs. A practical test: if Salesforce would treat two calls as the same lead, they should hash to the same key. Get this wrong and you either dedup genuinely different work (over-canonical) or fail to dedup retries (under-canonical). Wrong-answer notes: "hash everything" is the safe-sounding wrong answer; it makes retries with slightly different free-text fields look like new actions.

Q4. The dedup store goes down. What should the contract layer do for write-non-idempotent tools? Refuse the calls. A write-non-idempotent tool whose dedup store is unavailable is unsafe to execute, because retries cannot be deduplicated. Return a structured error indicating the contract layer cannot guarantee idempotency right now; the agent surface should treat this as "system temporarily degraded, please try again." Letting the call through "because the request looks fine" is how the next incident starts. Wrong-answer notes: "fall through and call anyway" trades correctness for availability — the wrong trade for write-non-idempotent.


What to do differently after reading this

  • Pull idempotency-key generation up to the workflow / step boundary. Audit existing tools for the Sunday-incident pattern.
  • Document the dedup window per tool in the contract. Compare to the downstream system's actual behaviour.
  • Make the dedup store a tier-zero dependency with its own monitoring.
  • For each write-non-idempotent tool, write a test that hammers it with the same key concurrently and verifies exactly one side effect lands.

Bridge. Idempotency is the mechanism that makes retries safe. Retries happen because errors happen. The next question is what errors look like from the model's perspective — because the model is the client deciding whether to retry. The next chapter builds the error contract: structured shapes the model can branch on, retriability as an explicit flag, and the corrective-action field that turns errors into useful information. → 05-error-contracts-the-model-can-recover.md