08. Human in the loop — designing the pause that protects the decision¶

~20 min read. Some decisions are too expensive to automate fully: production deploys, large financial actions, legal commitments, security containment. Human-in-the-loop is not a weakness admission — it's a control-plane primitive. This file covers when to pause, what to hand the human, how to wait, and how to resume without corruption.

Built on the first-principles overview in 00-first-principles.md. Human-time asymmetry — the pressure that machines act in seconds while humans decide in hours — is the central tension. The approval gate is the mechanism: a typed workflow node that pauses execution, packages evidence, waits for human judgment, and resumes cleanly regardless of how long the pause lasted.

What file 07 established and what remains¶

File 07 introduced the plan-execution manager and its deviation response: retry, fallback, replan, escalate, or abort. "Escalate" means transferring a decision to a human. But escalation creates a fundamentally different kind of workflow pause — one measured in minutes to days, not milliseconds. The workflow must survive that time gap: state must remain durable, context must be packaged for human consumption, and resume logic must handle a world that may have changed during the wait. This file makes that mechanism precise.

The $47,000 refund that got approved because the evidence packet was empty¶

A customer-support workflow handles refund requests. The automation determines eligibility, calculates amount, and routes high-value cases to a finance reviewer. A $47,000 refund triggers the human gate correctly — the workflow pauses.

The reviewer receives a notification: "Refund request #R-8291 requires approval. Amount: $47,000." No transaction history. No policy citation. No confidence explanation. No proposed action with reasoning. The reviewer, handling 30 pending approvals, clicks "Approve" based on the amount alone. The refund was fraudulent.

The gate fired correctly. The evidence packet was the failure. The human couldn't make a quality decision because the workflow didn't provide decision-grade context.

Bad approval gate:
  trigger: amount > $10,000  ✓ (correct)
  evidence packet:           ✗ (empty — just amount and ID)
  reviewer decision quality: garbage-in → garbage-out

Good approval gate:
  trigger: amount > $10,000  ✓
  evidence packet:           ✓ (transaction history, policy match,
                                 confidence score, fraud signals,
                                 proposed action with reasoning)
  reviewer decision quality: informed judgment

Teacher voice. The approval gate is not a speed bump. It's a decision node with a human executor. Like any node in the workflow graph, it needs a typed input contract — the evidence packet is the contract between the machine and the human reviewer.

The invariant: the human gate is a node with a contract, not a side chat¶

An approval gate follows the same rules as any workflow node from file 02: it has inputs (the evidence packet), preconditions (what state must exist before pausing), a success signal (human provides a valid decision), a timeout policy, and a resume path. The only difference is that the executor is a human, not an agent or tool — and humans operate on a different timescale.

This reframing matters. Teams that treat human review as "send a Slack message and hope someone responds" get: missed reviews, stale context, no timeout enforcement, no audit trail, and corrupt resume paths. Teams that treat human review as a typed workflow node get: structured evidence, enforced timeouts, auditable decisions, and clean resume semantics.

Two distinct patterns: approval vs escalation¶

The plan-execution manager (file 07) can trigger human involvement for two different reasons:

APPROVAL:
  "The automation knows what to do. It needs permission to proceed."
  Evidence packet: proposed action + justification + risk assessment
  Human job: authorize or deny the specific proposed action
  Resume: continue with approved action, or branch to denial handling

ESCALATION:
  "The automation doesn't know what to do. It needs human judgment."
  Evidence packet: situation summary + options considered + why automation is stuck
  Human job: decide what action to take (not just approve/deny)
  Resume: continue with human-chosen action injected into workflow state

Dimension	Approval	Escalation
Who decides the action?	Machine proposes, human authorizes	Human decides from scratch
Evidence packet focus	"Here's what I want to do and why"	"Here's what I see and why I'm stuck"
Decision complexity	Binary (approve/deny/edit)	Open-ended
Timeout handling	Remind → escalate to manager	Remind → degrade to simpler automation
Resume path	Clear (approved action or denial branch)	Depends on human's decision

Conflating these two patterns creates confused UIs and poor reviewer experience. A reviewer who expects to approve/deny but receives an open-ended "what should we do?" question will either guess or ignore. A reviewer who expects to make a complex decision but receives only an approve button will rubber-stamp.

Evidence packet design: the handoff contract for human reviewers¶

The evidence packet is the handoff contract (file 02) between the machine and the human. It determines review quality. Every approval gate should produce:

evidence_packet:
├── decision_summary: 1-2 sentence plain-language description of what's being asked
├── proposed_action: what the automation will do if approved
├── supporting_evidence:
│   ├── relevant data (transaction, code diff, search results)
│   ├── policy citations (which rules apply)
│   └── confidence signals (model certainty, evidence completeness)
├── risk_assessment: what happens if this is wrong
├── alternative_actions: what else could be done (for escalations)
├── context_timeline: what steps already completed, what's pending
└── urgency_signal: SLA deadline, customer waiting time, escalation path

The test: could a reviewer who has never seen this workflow before make a quality decision from this packet alone, within 2 minutes? If not, the packet is incomplete.

Mini-FAQ. "Won't comprehensive evidence packets slow down the workflow?" Generating the packet happens in machine-time (milliseconds). The human decision happens in human-time (minutes to hours). Investing 500ms of compute to save the human 10 minutes of context-gathering is always the right trade.

Threaded example: loan-approval compliance review gate¶

Return to the loan workflow. Step 3 (compliance check) returns compliance_flag: "review". The plan-execution manager triggers the conditional human_review step. The approval gate activates:

Approval gate: loan_compliance_review

Trigger: compliance_flag == "review"

Evidence packet:
├── summary: "Applicant A-7291 flagged for manual compliance review.
│            Credit score 720 (acceptable) but employment history
│            shows recent industry change requiring manual verification."
├── proposed_action: "Approve loan application pending employment letter"
├── supporting_evidence:
│   ├── credit_score: 720 (source: Experian, pulled 2 min ago)
│   ├── identity_verified: true (government ID match)
│   ├── flag_reason: "INDUSTRY_CHANGE — moved from finance to tech 45 days ago"
│   └── policy_citation: "Policy §4.2: employment changes within 90 days
│                         require manual review for income stability"
├── risk_if_wrong: "Approve → default risk elevated 12% per actuarial model.
│                   Deny → lose legitimate applicant, appeal likely."
├── urgency: "Applicant waiting. SLA: 48h decision window. 46h remaining."
└── actions: [approve, deny, request_more_info]

Checkpoint saved: full workflow state at gate entry
Timeout policy: 48h → escalate to senior underwriter
Reminder: 24h → send reminder to assigned reviewer

The reviewer sees a structured decision interface — not a raw notification. They can approve (workflow continues to issue_decision), deny (workflow branches to denial handling), or request more info (workflow pauses again after gathering additional data).

Now the human takes 6 hours to decide. During those 6 hours: - The workflow state is safely checkpointed (file 06's PostgresSaver) - The dispatch loop monitors the timeout clock - If the reviewer doesn't act within 48h, the policy engine escalates

After 6 hours, the reviewer approves. Resume logic:

Resume after approval:
1. Load checkpoint (state at gate entry)
2. Inject human decision: human_override = "approved"
3. Verify: has anything invalidated this decision?
   - credit_score still fresh? (pulled < 24h ago) → yes
   - applicant status unchanged? → yes (no new fraud signals)
4. Advance to next step: issue_decision

Timeout and abandonment: what happens when humans don't respond¶

Human-time asymmetry means workflows can wait indefinitely. Without timeout policy, paused workflows accumulate as stale zombies — consuming state storage, holding locks, and confusing dashboards.

Timeout policy spectrum:

immediate ────────────── gentle ────────────── aggressive

"Fail workflow     "Remind at 50%,         "Auto-approve
 after 1h"         escalate at 100%,        low-risk cases
                   fail at 150%"            after timeout"

Timeout action	When appropriate	Risk
Remind reviewer	Always (first action)	Low — just a nudge
Escalate to senior	When original reviewer is unresponsive	Medium — senior may lack context
Auto-deny (safe default)	When denial is the conservative action	Medium — may block legitimate work
Auto-approve	Only for low-risk, time-sensitive cases with strong evidence	High — defeats the purpose of the gate
Fail workflow	When the decision window has hard SLA	Medium — requires restart or human triage

The key: timeout policy is a product decision, not a backend default. Finance workflows and security workflows have fundamentally different timeout tolerances. The control plane must support per-gate timeout configuration.

Resume semantics: the world may have changed¶

The most subtle failure in human-in-the-loop design: the workflow pauses for human review, the human approves hours later, and the workflow resumes into a world that has changed. The approval was valid at pause time — but is it still valid at resume time?

Freshness checks at resume:

1. Did the input data expire?
   - credit score older than 24h → re-pull before proceeding
   - identity verification older than 7 days → still valid

2. Did external state change?
   - applicant flagged for fraud during wait → abort, don't use approval
   - policy version updated → re-evaluate against new policy

3. Did the workflow state get modified?
   - another workflow already processed this applicant → detect conflict
   - budget allocation changed → verify decision still within budget

4. Is the human decision still semantically valid?
   - "approved with condition X" but condition X no longer applies → re-ask

Not every resume needs freshness checks. For short pauses (minutes), staleness is unlikely. For long pauses (hours to days), at least one check on the critical input data is essential. The checkpoint should record when it was written; resume logic should compare that timestamp against data freshness requirements.

Operational signals — healthy gate, degrading gate, broken gate¶

Healthy behaviour: - Evidence packets generated in < 500ms - Reviewer median response time within SLA (e.g., < 4h for standard, < 1h for urgent) - Approval/denial ratio matches expected policy distribution (not 99% approve → rubber-stamping) - Resume succeeds cleanly on first attempt

First degrading signal: - Reviewer response time increasing → queue overload, unclear urgency, or poor evidence quality - Approval rate approaching 100% → rubber-stamping (evidence packet not demanding enough attention) - Resume failures increasing → freshness checks failing, state corruption during wait - Timeout escalations increasing → wrong reviewer assignment, unclear ownership

Misleading metric: - "Time to approve" alone — fast approval isn't valuable if it's uninformed. Measure decision quality alongside speed. - "Number of gates triggered" — fewer gates isn't always better. Some workflows genuinely need multiple human checkpoints.

Expert signal: - Decision reversal rate: how often do downstream outcomes contradict the human's gate decision? High reversal → evidence packet quality problem. - Reviewer confidence distribution: if reviewers report low confidence on most decisions, the evidence packet is insufficient.

Boundary of applicability¶

Works unusually well: - Compliance-heavy domains (finance, legal, healthcare) where regulatory requirements mandate human oversight - High-consequence actions (production deploys, large transactions, security containment) where error cost justifies delay - Escalation paths where automation genuinely lacks information to proceed

Becomes pathological: - When gates multiply to the point where the workflow is effectively manual with extra steps - When evidence packets are auto-generated without quality control (rubber-stamping follows) - Real-time user-facing flows where human latency kills the product experience

Scale that invalidates naive intuition: - At 1000+ approvals/day, individual review quality degrades unless routing ensures specialist reviewers see relevant cases - At human response times > 24h, the freshness problem dominates — most resume logic needs re-validation

Failure-prone assumption: "the human gate guarantees safety"¶

The seductive wrong idea: "We added human review, so we're safe."

The correction: A human gate guarantees only that a human saw something. Whether they made a quality decision depends entirely on: (1) the evidence packet quality, (2) the reviewer's expertise match, (3) the reviewer's workload and attention, (4) the decision interface clarity, and (5) the timeout policy. A gate with a poor evidence packet and an overloaded reviewer provides the illusion of oversight without the substance.

Real-world implementations¶

Intercom Fin — uncertain or policy-sensitive customer replies route to human agents with full conversation context, policy citations, and suggested responses; agents approve, edit, or replace the AI draft
Harvey (legal AI) — document drafts pause at attorney review gates with clause-by-clause annotations showing extraction confidence, source citations, and flagged ambiguities
GitHub Copilot coding agent — proposes pull requests that require human review before merge; the evidence packet is the PR description, diff, test results, and CI status
ServiceNow Now Assist — approval workflows for access requests, change tickets, and procurement present structured evidence to approvers with policy context and risk signals
Stripe Radar (fraud review) — high-risk transactions pause for analyst review with transaction history, device fingerprints, behavioural signals, and model confidence scores
Tesla Autopilot disengagement — safety-critical decisions require driver takeover; the system provides visual and audio alerts with situation context
Palantir AIP — classified analysis workflows require analyst approval at key decision points with full provenance chains and confidence intervals
Notion AI document workflows — multi-step content generation pauses for user confirmation before publishing, presenting proposed changes with tracked diffs

Recall checkpoint¶

What's the difference between an approval gate and an escalation gate?
What fields belong in a decision-grade evidence packet?
Why must the approval gate be treated as a typed workflow node with a contract?
What freshness checks should run at resume time after a long human pause?
Why does a 99% approval rate signal a problem?
What timeout actions exist along the spectrum from gentle to aggressive?
When does adding more human gates make the system worse?

Interview Q&A¶

Q: Why is the evidence packet more important than the gate trigger condition? A: The trigger determines when to pause. The evidence packet determines whether the human can make a quality decision. A perfect trigger with a poor packet produces rubber-stamping — which is worse than no gate at all because it creates false confidence. Common wrong answer to avoid: "Because humans are busy." Busyness is real, but the core issue is decision quality, not speed.

Q: Why treat human review as a typed node rather than a Slack message? A: Typed nodes have inputs (evidence packet), success signals (valid decision received), timeout policies, and resume paths. Slack messages have none of these — no enforcement, no audit trail, no structured resume. The workflow can't govern what it can't model as a node. Common wrong answer to avoid: "Because Slack is informal." Formality isn't the point — workflow governance is.

Q: Why separate approval from escalation in workflow design? A: Approval asks "may I proceed with this specific action?" — the human authorises. Escalation asks "what should I do?" — the human decides. Different evidence packets, different UI patterns, different resume paths. Conflating them confuses reviewers and degrades decision quality. Common wrong answer to avoid: "Because escalation is harder." Difficulty isn't the distinction — it's who owns the decision.

Q: Why can long human pauses corrupt workflow state? A: The world changes during the pause: input data goes stale, policies update, external systems change state, other workflows may process the same entity. Resuming without freshness checks can execute an action whose preconditions no longer hold. Common wrong answer to avoid: "Because checkpoints expire." Checkpoints don't expire — but the external reality they reference does.

Q: Why is 99% approval rate a signal of broken design? A: If reviewers approve nearly everything, either the gate triggers too often (low-risk cases shouldn't be gated), or the evidence packet doesn't surface enough information for reviewers to identify problems. Either way, the gate isn't providing the governance value it should. Common wrong answer to avoid: "Because the AI is very accurate." Even accurate systems should have a non-trivial denial rate when the gate exists for risky cases.

Q: When should a timeout auto-approve rather than fail the workflow? A: Only when: (1) the evidence strongly supports the action, (2) the risk of delay exceeds the risk of wrong action, (3) the decision is reversible, and (4) audit explicitly records that the action was auto-approved without human confirmation. In most compliance scenarios, auto-approve is inappropriate. Common wrong answer to avoid: "When the reviewer doesn't respond in time." Non-response alone doesn't justify bypassing the governance reason for the gate.

Design/debug exercise (10 min)¶

Modeled: The loan-approval compliance gate fires. Evidence packet includes: credit score, flag reason, policy citation, risk assessment, and urgency signal. Reviewer approves after 6 hours. Resume logic verifies credit score freshness (< 24h, still valid), checks for new fraud signals (none), advances to issue_decision.

Your turn: Design an escalation gate for the same loan workflow. The compliance agent returns compliance_flag: "fail" with reason "applicant has active bankruptcy filing — outside standard policy." This is beyond the automation's authority to resolve. Write: (1) the escalation evidence packet, (2) the decision options presented to the human, (3) the resume path for each option, (4) the timeout policy.

From memory: Close this file and sketch: the approval vs escalation comparison table, an evidence packet with 6 fields for a workflow you know, and the resume freshness-check sequence after a 24-hour pause.

Operational memory¶

Human-in-the-loop is a control-plane primitive, not an admission of weakness. It protects high-consequence decisions by introducing a human executor at typed workflow nodes. The quality of that protection depends entirely on the evidence packet — the handoff contract between machine and human reviewer. A gate that fires correctly but hands the human empty context is worse than no gate, because it creates the illusion of oversight without the substance.

The two patterns — approval (machine proposes, human authorises) and escalation (machine is stuck, human decides) — need different evidence packets, different UIs, and different resume paths. The timeout problem is real: human-time asymmetry means workflows can wait hours or days. Timeout policies must be product decisions (remind, escalate, deny, fail) configured per gate. And resume logic must verify that the world hasn't changed during the wait — freshness checks on input data and external state prevent resuming into stale assumptions.

Remember: - Evidence packet quality determines review quality — garbage-in, garbage-out applies to human reviewers too - Approval gates ask "may I?" — escalation gates ask "what should I?" - Checkpoint before the pause (not after) so crashes during human-wait don't lose progress - Timeout policies are product decisions: remind → escalate → deny/fail, configured per domain - 99% approval rate signals rubber-stamping, not accuracy - Resume after long pauses needs freshness checks on the data that informed the decision - The approval gate is a workflow node with a typed contract — not a Slack notification

Bridge. The approval gate pauses for human judgment. But crashes, infrastructure failures, and process restarts also interrupt workflows — without the courtesy of a designed pause point. Next we build checkpoint-and-recovery: the mechanism that makes workflows resume cleanly after unplanned interruptions. → 09-checkpoint-recovery.md