Skip to content

07. Irreversible actions and approvals

~8 min read. Credentials bound what the tool can do. Some actions, even authorised, cannot be undone — money transfers, mass deletions, external sends, infrastructure destruction. The approval layer is the human-in-the-loop gate for actions classified as irreversible.

Continues from 06-secrets-and-credentials-in-tools.md. This chapter develops the approval layer. Recurring concepts in bold: action classification, approval gate, dry-run, idempotency token, delayed-execution cancel window, two-party approval.

A reversible action can be undone in software; the apparatus and the user have a recovery path. An irreversible action lands in the world: an email sent, money transferred, a record deleted, a server destroyed. The approval layer's job is to ensure no model decision, alone, lands an irreversible action.


What classifies as irreversible

Three categories:

Category Example Why irreversible
External side-effect Email to a third party, social-media post, payment to vendor The recipient has the artefact; recall is best-effort
Mass change Delete > N records, modify > N resources, ship a release Recovery scales with the change; bounded undo costly
Regulated action Healthcare record write, financial transaction, legal filing Regulators require human approval; audit demands the trail

The classification is conservative: when in doubt, classify as irreversible. A misclassified reversible action costs an extra approval click; a misclassified irreversible action costs an incident.


The approval gate

The gate is a human-in-the-loop checkpoint that the tool's execution does not cross until approval is granted.

Forms:

  • UI confirmation. The user is shown the proposed action with parameters; they click to approve.
  • Out-of-band channel. An approval request is sent to a separate channel (Slack, email, ticket); execution waits.
  • Delayed execution with cancel window. The action is scheduled with a 5-minute (or longer) cancel window; if the user does not cancel, it proceeds.
  • Two-party approval. Two different roles must approve; useful for high-stakes irreversible actions.

The gate is per-action: the tool can be called many times with reversible parameters and proceed; when the parameters classify as irreversible, the gate fires.


Dry-run and preview

For complex irreversible actions (a Terraform plan, a database migration, a bulk update), the dry-run is a structural part of the approval:

  • The tool executes in dry-run mode: produces the artefact it would create, the changes it would make, the resources it would affect — without executing.
  • The user reviews the dry-run output.
  • The user approves the execution; the tool runs in actual mode.

The dry-run is read-only: it never mutates anything. The actual execution honours the dry-run's plan; if conditions change between dry-run and approval, the execution refuses (the conditions are part of what the user approved).


Idempotency tokens

A common pattern that strengthens the approval layer: idempotency tokens.

  • The tool's action is parameterised with a token (typically a UUID generated for this action).
  • The downstream system records the token; if the same action with the same token is replayed, the system returns the original result without re-executing.
  • The approval includes the token; an approved action can be executed only once.

This defends against double-execution from network retries, agent retries, or accidental re-approval. The token is the contract: same token, same effect.


A worked example — the payments tool

The Hyderabad fintech's payments tool initiates money transfers. The approval design:

  • Classification. All payment initiations are irreversible.
  • Dry-run. The tool produces a payment intent: source, destination, amount, currency, timing, idempotency token. No transfer happens.
  • Approval UI. The user sees the payment intent in the agent's UI. They click "Confirm payment."
  • Two-party approval for high amounts. Payments > ₹100,000 require a second approval from a designated authoriser.
  • Cancel window. Confirmed payments have a 60-second cancel window before the actual transfer.
  • Idempotency. The token is included in the bank API call; the bank stores it; retries return the original result.

A jailbroken model attempting to transfer funds is bounded by the approval gate: the user (or the second authoriser) does not approve, and no transfer occurs. The model's "intent" is captured in audit but produces no real effect.

This is the structural protection. It costs an extra UI step per payment; it eliminates the class of incident where the model unilaterally transfers money.


When approval is too friction-heavy

Some workflows cannot bear UI approval per action:

  • High-frequency automated workflows (a batch process touching thousands of records).
  • Server-initiated actions where no user is present.
  • Operations where a human-in-the-loop is genuinely impractical.

The patterns for these:

  • Approval at the policy level, not the call level. A class of actions is approved in advance (e.g., "this batch may touch up to N records of type X"); individual calls within the policy proceed.
  • Approval at the limit level. Within a defined limit, the action proceeds; beyond the limit, approval fires.
  • Asynchronous review. The action proceeds but is queued for human review within a defined window; if review finds it wrong, the rollback path is structured.
  • Confidence-weighted approval. High-confidence actions proceed; low-confidence actions fire the gate.

Each pattern is a defensible alternative for friction-heavy contexts. The discipline is to be explicit about which pattern is used for which workflow.


Operational signals

Healthy. Every tool with irreversible actions has the approval layer. Approval timing is reasonable (no excessive delays). The classification is reviewed quarterly.

First degrading metric. Approval rate climbing to near 100%. Either every action genuinely needs approval (correct classification, expected workflow) or users are rubber-stamping (approval has degraded to clicks). Investigation distinguishes.

Misleading metric. Number of approvals issued. High volume is not necessarily a problem; rubber-stamped approvals are.

Expert graph. Per-tool approval rate, mean time to approval, rejection rate, after-the-fact incidents traced to approved actions.


Boundary of applicability

Strong fit. Tools that touch money, external parties, regulated data, or mass changes. The approval layer is non-negotiable.

Pathology. Approval applied to reversible actions out of caution. Every approval costs latency and user attention; over-broad approval produces approval fatigue, which then degrades the approval signal for actions that genuinely need it.

Scale limit. Very high-volume workflows benefit from policy-level approval rather than call-level. The pattern remains; the granularity shifts.


Failure-prone assumption

The seductive wrong belief: the model will not initiate destructive actions if the prompt is clear. The model will, under enough adversarial pressure, edge cases, or genuine ambiguity. The approval layer is the structural defence: regardless of why the model initiated the action, the action does not land without approval. The correct belief: approval is enforced; model behaviour is not relied on.


Where this appears in production

  • A fintech has all payment initiations through approval; jailbreaks have produced approval requests, never transfers.
  • A devops AI has Terraform plans always dry-run and reviewed; no apply happens without explicit user click.
  • A customer-support AI has refund actions approved by a senior agent for amounts above a threshold.
  • A coding assistant has GitHub PR creation as the irreversible action; the PR is the approval gate (humans review and merge).
  • A healthcare AI has prescription writes approved by the prescribing clinician; the AI proposes; the clinician approves.
  • A telecom AI has bulk-message sends approved per campaign; per-message approval would be impractical.
  • A retail AI has order modifications above a value threshold approved by a CSR; below the threshold, the AI proceeds.
  • A consumer chatbot had an account-deletion tool with no approval; a single jailbreak deleted accounts; approval added post-incident.
  • A legal AI has contract amendment proposals reviewed by counsel; no amendment lands without sign-off.
  • A travel platform has flight bookings confirmed by the user; the AI prepares; the user confirms.
  • A government AI has approval flows mapped to regulatory requirements; every regulated action has a recorded approver.
  • A media AI has social-media posts in a draft queue; the social team approves before publication.
  • A B2B SaaS has API key generation as an irreversible action with two-party approval.
  • A search-ops AI has index drops with dry-run and approval; the dry-run shows what data would be lost.
  • A logistics AI has shipment rerouting approved when the new destination is in a different region.
  • A staffing AI has candidate-rejection messages reviewed; the AI drafts; a recruiter sends.
  • A document AI has document destruction with two-party approval; the second party is a different team.
  • An ad-tech AI has campaign launches approved by the campaign owner; the AI prepares; the owner launches.
  • A real-estate AI has property listings updated only with the owner's approval.
  • A medical AI has irreversible patient-record changes through clinician approval with audit trail.

Recall / checkpoint

  1. Name the three categories of irreversible actions.
  2. What are the four forms of approval gates?
  3. What is dry-run, and how does it strengthen approval?
  4. What does an idempotency token defend against?
  5. How does approval scale when call-level approval is impractical?
  6. What is approval fatigue, and how does the apparatus detect it?
  7. Why is "the model is well-prompted" not a substitute for the approval layer?

Interview Q&A

Q1. A team's payments tool has no approval gate. The lead argues the model would not transfer money incorrectly. Walk through the pushback. The model is one jailbreak, one hallucination, one misunderstanding away from initiating an incorrect transfer. The approval layer is the structural defence: regardless of why the model initiated the action, the transfer does not land without approval. The lead's argument relies on model behaviour, which is probabilistic; the approval layer makes the system deterministic on the irreversible boundary. The cost is one extra click per payment; the avoided cost is incidents that scale to the bank account. Common wrong answer to avoid: "we'll audit and recall if wrong" — recalls are best-effort and slow; prevention is structural.

Q2. Walk through the dry-run pattern for a Terraform plan. The tool runs terraform plan: produces the proposed change list, the resources affected, the differences from current state. No apply happens. The user reviews the plan; if accepted, the tool runs terraform apply with the exact plan that was approved. If conditions change between plan and apply (drift), the apply refuses; the user re-runs the plan. The pattern preserves the approval's specificity: the user approves a plan, not a goal. Common wrong answer to avoid: "user approves the goal; tool figures out the plan" — too loose; the tool's interpretation can drift from the user's intent.

Q3. Approval gate for high-volume automated workflows seems impractical. How do you scope? Policy-level approval. The class of actions is approved in advance (e.g., "this batch processor may delete up to 10,000 records of class X per run"); calls within the policy proceed. Beyond the policy, approval fires. The discipline is to make the policy explicit, time-bounded, and reviewed; not to declare "approval is impractical" and skip the layer entirely. Common wrong answer to avoid: "skip approval for automated workflows" — skips the defence entirely.

Q4. The approval rate is climbing toward 100%. Why might that be a problem? Rubber-stamping. The user has stopped reading the approval; clicks are reflexive. The approval has degraded from a thoughtful check to a UI tax. The fix is to investigate: are the approvals all legitimate (in which case the gate is correctly catching everything) or are users approving wrong actions (in which case the friction is too low to engage attention)? Mitigations: variable friction (more friction for higher-stakes), spot-check rejected actions, train users on what to look for. Common wrong answer to avoid: "high approval rate means good UX" — high rubber-stamping rate means the gate is degraded.

Q5. What is the idempotency token's role in the approval flow? The token binds the approved action to a single execution. The user approves an action with a specific token; the downstream system records that token on execution; replays with the same token return the original result. This defends against double-execution from retries, network duplicates, or accidental re-approval. The approval is per-token; once executed, the token is exhausted. Common wrong answer to avoid: "tokens are for retry logic" — they are, but they also serve as the structural one-time-execution guarantee.

Q6. How do you choose between UI approval and out-of-band approval? UI approval when the user is in front of the agent's UI and the action is part of their flow. Out-of-band (Slack, email, ticket) when the action arises in a context where the user is not actively in the UI, or when the action requires approval by a different role (e.g., manager approval of an employee-initiated action). Delayed-execution with cancel window when both forms are awkward. Two-party approval when the action's stakes justify the friction. Common wrong answer to avoid: "UI for everything" — fails for asynchronous workflows.


Design / debug exercise (10 minutes)

Modelled example. Walk through the worked example (the payments tool). Verify the action classification, the dry-run, the approval forms, the idempotency token.

Your turn. List your team's tools. For each, classify actions as reversible, recoverable, or irreversible. For the irreversible set, specify the approval form. Identify any tool where the approval is missing or under-scoped.

Reproduce from memory. Write the three categories of irreversible actions and the four approval forms from memory. The signal of internalisation is that you can classify a new tool's actions and design its approval quickly.


Operational memory

This chapter explained the approval layer: a human-in-the-loop gate for irreversible actions, with classifications, gate forms, dry-runs, idempotency tokens, and policy-level alternatives for high-volume workflows. The important idea is that no model decision should land an irreversible action; approval is enforced regardless of why the model initiated the action.

You learned to classify actions conservatively, design appropriate approval gates per category, use dry-runs for complex actions, and detect approval fatigue. That solves the opening failure because the irreversible boundary is now defended by enforcement, not by model trust.

Carry this diagnostic forward: when a tool can land an action with no human in the loop, you have found the next approval gate to install.

Remember:

  • Three irreversible categories: external side-effect, mass change, regulated.
  • Four approval forms: UI, out-of-band, delayed-execution, two-party.
  • Dry-run strengthens approval for complex actions.
  • Idempotency tokens defend against double-execution.
  • Policy-level approval is the alternative for high-volume workflows.

Bridge. Approval bounds the action before execution. Output validation bounds what flows back from the tool to the model. The next chapter is the sanitisation discipline — treating tool output as untrusted input to the next model turn. → 08-output-validation-and-sanitisation.md