09. Human escalation — know when to call the senior doctor¶

~14 min read. Good AI systems do not try to win every case alone; they know when a human should take over.

Built on the ELI5 in 00-eli5.md. The senior doctor — human review for uncertain or high-stakes cases — is not a failure of the system. It is part of the system.

1) First picture: confidence is not permission¶

A model sounding confident does not mean it should act. A pipeline being available does not mean it should decide. The decision to escalate depends on uncertainty, stakes, and reversibility.

request arrives
    │
    ├── low risk + high confidence ─────→ auto handle
    ├── medium risk + mixed confidence ─→ queue or review
    └── high risk or low confidence ────→ human escalation

The simple version: The senior doctor is a planned route, not panic mode. If escalation feels like defeat, your product goals are childish. Reliable systems optimize correct outcomes,

not robot ego.

2) When should AI escalate?¶

Common triggers are predictable.

confidence below threshold,
conflicting evidence,
missing required tool data,
policy ambiguity,
high-cost irreversible action,

vulnerable user or sensitive domain.

escalation triggers
┌────────────────────────────────────────┐
│ confidence < 0.75                      │
│ retrieved facts disagree               │
│ account identity not fully verified    │
│ action changes money, health, or legal │
└────────────────────────────────────────┘

Escalation is not only about low model confidence. A high-confidence answer with missing identity proof is still unsafe. A fast answer in a regulated workflow may still require approval. The triage desk should combine risk and uncertainty. Worked example.

A support assistant can answer shipping FAQ automatically. It can also issue store credit. For FAQ, confidence threshold of 0.70 may be fine. For store credit above ₹5,000, confidence is irrelevant.

Human approval is mandatory. That is policy-based escalation.

3) Build escalation queues with useful context¶

Now what is the operational trap? Teams escalate without enough evidence. Then humans must redo the whole case. That wastes time. A good escalation packet should include:

user request,
model draft,
retrieved evidence,
tool outputs,
confidence features,
failure signals,

recommended next action.

human review packet
┌─────────────────────────────┐
│ request: refund damaged TV  │
│ draft: approve ₹4000 credit │
│ confidence: 0.62            │
│ issue: order photo unclear  │
│ evidence: 3 retrieved docs  │
│ recommendation: review now  │
└─────────────────────────────┘

The simple version: The senior doctor should see the chart immediately. Not ask five follow-up questions first. For example, a lending assistant cannot verify income consistency. Escalation packet includes:

applicant ID,
conflicting document fields,
raw extracted values,
source files,
machine recommendation: manual verification. That lets the reviewer act quickly.

4) Thresholds should differ by consequence class¶

One universal confidence threshold is lazy. Different harms need different caution.

use case                    escalation rule
FAQ answer                  low threshold
order-status explanation    medium threshold
refund approval             high threshold or mandatory review
medical guidance            human review for many classes

The threshold is not about math elegance. It is about harm tolerance. Worked example. Suppose the same model score of 0.78 appears in two cases. Case A.

"Summarize this help-center article." Auto-answer is fine. Case B. "Confirm whether this chest pain looks low risk." Auto-answer is not fine. Same score.

Different action. The senior doctor rule depends on domain consequence.

5) Escalation must preserve user experience too¶

Now what is the user-facing danger? A product may escalate internally, but tell the user nothing useful. Then the user feels abandoned. The practical response: Expose the handoff honestly.

Give expected timing. Give a summary of what was captured. Keep the conversation state.

better handoff message
"I have sent this to a human specialist.
They will review the order details I already collected.
You should not need to repeat them."

That sentence carries trust. The senior doctor path should feel continuous, not like a dropped call. For example, a tax assistant escalates a filing question.

Bad behavior: "Please contact support." Good behavior: "I have flagged this for a tax specialist because your case mixes state and international income. Your documents and my draft notes are attached to the review ticket." Huge difference.

6) Human escalation must be measurable¶

Do not hide escalation inside operations. Measure it like any other path. Track:

escalation rate,
time to human pickup,
time to resolution,
overturn rate,
complaint rate after escalation,

false-positive escalation rate.

escalation dashboard
┌────────────────────────────┐
│ escalation rate = 6.2%     │
│ pickup time p95 = 4 min    │
│ overturn rate = 18%        │
│ missed escalation rate = ? │
└────────────────────────────┘

Now what is the hardest metric? Missed escalations. These are cases that should have reached the senior doctor, but did not. They are often found through audits, complaints,

or incident review.

7) Escalation can also be approval, not full takeover¶

Sometimes humans do not need to do the whole task. They only need to approve or reject the risky step.

AI prepares recommendation
        │
        ▼
human approves action
        │
        ▼
system executes with audit trail

The simple version: This keeps speed where possible, and control where necessary. Worked example. A procurement assistant drafts a vendor payment exception. Human finance manager only approves the exception.

The AI keeps the paperwork organized. The person keeps authority. That is a beautiful senior doctor pattern.

Where this lives in the wild¶

Klarna assistant — payment risk lead: routes high-value refund and charge-dispute actions to human approval queues even when the model appears confident.
Intercom Fin — operations manager: escalates policy-edge support cases with retrieved evidence and draft replies so agents review faster instead of restarting the conversation.
GitHub Copilot workspace actions — security-conscious engineering lead: requires human approval before repository-wide code modifications or secret-affecting changes are applied automatically.
Morgan Stanley internal assistant — compliance reviewer: uses manual review for ambiguous investment-policy interpretations despite strong model confidence because regulatory consequence is high.
Healthcare symptom triage assistants — clinical safety owner: escalate ambiguous or high-risk symptom combinations to nurses or physicians rather than pretending diagnostic certainty.

Pause and recall¶

Why is model confidence only one of several escalation signals?
What information should a human review packet contain?
Why should escalation thresholds vary by consequence class?
How can human escalation preserve continuity for the user?

Interview Q&A¶

Q: Why should escalation policy depend on consequence class rather than a single global confidence threshold? A: Equal confidence does not imply equal acceptable risk, so harm tolerance must shape the handoff rule. Common wrong answer to avoid: "Because confidence scores are unreliable, so thresholds never matter." Scores are imperfect but still useful inputs. Q: Why is a human approval step sometimes better than full human takeover? A: It preserves automation speed for routine preparation while reserving human judgment for the irreversible or risky decision. Common wrong answer to avoid: "Because approvals are easier to staff than experts." Staffing may help, but control design is the main point. Q: Why should escalation packets include machine evidence and not just the user message? A: Reviewers need context on what the system saw, inferred, and failed to verify so they can act quickly and auditably. Common wrong answer to avoid: "Because humans trust machine evidence more than user evidence." The goal is completeness, not bias. Q: Why is missed escalation a critical metric? A: Over-escalation is costly, but under-escalation allows risky cases to slip through without the human judgment they required. Common wrong answer to avoid: "Because missed escalation mainly affects queue length." It mainly affects correctness and safety.

Apply now (5 min)¶

Exercise. Define escalation rules for one AI workflow. Choose at least four triggers, separate low-risk and high-risk paths, and specify whether the human does full takeover or approval only.

Sketch from memory. Draw the auto-handle versus senior doctor split. Add the review packet box, and write one sentence the user would see during handoff.

Bridge. Human escalation handles one risky case well. But some failures spread before any human can notice. Next we study cascading failure and containment. → 10-cascading-failure.md