Skip to content

06. Graceful degradation — keep the patient stable first

~13 min read. A system that cannot be perfect should still avoid becoming dangerous, misleading, or useless.

Built on the ELI5 in 00-eli5.md. The stability kit — reduced but safe service — is what keeps trust alive when the backup ambulance cannot fully restore normal care.


1) First picture: degraded does not mean broken

Users do not need magic. They need clarity, predictability, and enough help for the moment.

normal mode
full answer + tools + verification + actions

degraded mode
partial answer + no risky actions + clear limits

The simple version: Graceful degradation is not hiding failure. It is controlled simplification. The stability kit says, "I cannot do everything right now, but I can still do these safe things."

That is much better than silent nonsense.

2) Common degradation patterns in AI products

Different features degrade differently. A summary tool may shorten output. A support bot may answer FAQs but pause account changes. A coding agent may suggest patches but stop auto-applying edits. A research assistant may show sources only, without synthesis.

degradation ladder
┌─────────────────────────────────────────┐
│ full mode: generate + verify + act      │
│ mode 2: generate + verify, no action    │
│ mode 3: retrieve + template only        │
│ mode 4: honest apology + human queue    │
└─────────────────────────────────────────┘

See the ladder. The stability kit is not one state. It is several safe lower gears. For example, a commerce assistant normally:

  • finds orders,
  • explains policies,
  • initiates returns. During tool instability, degraded mode becomes:

  • explain policy,

  • show last known order snapshot,
  • tell the user return initiation is temporarily paused. User value drops. But the answer stays honest.

3) Honest messaging is part of the system design

The production problem: Many products degrade silently. They give a low-quality answer as if it were normal. That harms trust. Users hate hidden downgrade more than visible limitation. The practical response:

Say what is unavailable. Say what remains available. Say what the user should do next.

bad degraded answer
"Your issue is resolved."

better degraded answer
"I can explain the likely policy right now,
but I cannot confirm your live account status.
Please use the human support link for account-specific action."

The simple version: The stability kit needs words, not just architecture. For example, a legal document assistant loses citation verification. Bad degraded behavior:

still writing confident legal summaries. Good degraded behavior: "I can give a rough summary, but source verification is unavailable, so do not use this for filing or compliance approval." This is graceful.

It protects the user, and protects your team.

4) Degradation should remove riskier capabilities first

A mature design does not degrade randomly. It sheds danger before convenience.

capability stack
      safest
        ├── show static status
        ├── summarize known data
        ├── suggest possible next steps
        ├── call write tools
        └── commit irreversible action
      riskiest

When stress comes, cut from the bottom first. Keep reading before writing. Keep explanation before execution. Keep drafts before automatic commits.

For example, a code assistant has three capabilities.

  1. Explain code.
  2. Suggest edits.
  3. Apply edits automatically. During validation-service outage, degraded mode should keep 1, maybe keep 2, and disable
  4. The triage desk and stability kit should both respect risk order.

5) Degradation needs explicit entry and exit rules

Now what is the operational trap? Teams add degraded copy, but never define when to enter or leave degraded mode. Then users see random behavior. That is bad product design.

enter degraded mode if:
- breaker is open
- retry budget exhausted
- critical verifier unavailable
- latency budget cannot support full path

exit degraded mode if:
- dependency healthy for probe window
- verifier recovered
- backlog under threshold

See the discipline. The sealed ward often triggers the stability kit. When health returns, we go back up the ladder carefully. For example, a tutoring assistant relies on a moderation model.

Moderation becomes unavailable. Degraded policy says:

  • disable open-ended student chat,
  • allow only curriculum-aligned lesson summaries,
  • queue complex conversations for later. Once moderation health is stable for ten minutes, restore broader chat. That is not guesswork. That is policy.

6) Measure degraded mode as a real product path

Do not hide degraded mode from analytics. Measure it separately. Users experience it differently. You need to know:

  • how often degradation happens,
  • which path triggered it,
  • success rate in degraded mode,
  • complaint rate in degraded mode,
  • whether users recover or churn.
    degraded_mode_metrics
    ┌─────────────────────────────┐
    │ share of traffic = 8%       │
    │ p95 latency = 1.9 s         │
    │ safe completion = 96%       │
    │ complaint rate = 3.2%       │
    └─────────────────────────────┘
    

The simple version: The stability kit is part of the product surface. Treat it seriously.

7) Worked end-to-end example

Suppose a medical triage chatbot normally:

  • takes symptoms,
  • asks follow-up questions,
  • classifies urgency,
  • suggests next steps. Suddenly the evidence-verification service fails. Full mode is unsafe. Graceful degradation policy says:

  • keep symptom collection,

  • provide emergency warning signs,
  • stop personalized triage classification,
  • display a direct nurse hotline option. The user still gets immediate safety guidance. The system avoids pretending certainty. That is textbook stability kit behavior. The senior doctor path remains available for risky cases.

Where this lives in the wild

  • GitHub Copilot — product reliability manager: keeps inline suggestions alive during backend stress while disabling high-risk autonomous edit flows until validation services recover.
  • Intercom Fin — customer support platform lead: degrades from action-taking support automation to policy explanation and support-routing only when account tools are unstable.
  • Khanmigo — education safety engineer: narrows to lesson-aligned hints when moderation or grounding services are unhealthy instead of continuing broad free-form tutoring.
  • Cursor — agent experience engineer: drops from auto-apply refactors to suggestion-only mode when repository checks or patch verifiers are unavailable.
  • Perplexity — search answer owner: can show source links and known result snippets even when live synthesis or reranking quality is degraded.

Pause and recall

  • Why is graceful degradation different from pretending the system is still in full mode?
  • What kinds of capabilities should usually be removed first during degradation?
  • Why must degraded messaging be explicit to the user?
  • What entry and exit rules should degraded mode have?

Interview Q&A

Q: Why should degraded mode remove risky capabilities before convenient capabilities? A: Under uncertainty, protecting users from harmful actions matters more than preserving every feature surface. Common wrong answer to avoid: "Because risky features are harder to code." Difficulty is not the design principle here. Q: Why is honest degraded messaging better than silently returning lower-quality output? A: Honest messaging preserves trust and lets users choose safer next steps instead of over-trusting a weak answer. Common wrong answer to avoid: "Because users prefer apologies to solutions." Users prefer useful clarity, not empty apology. Q: Why should degraded mode have explicit entry and exit conditions? A: Without clear rules, product behavior becomes inconsistent and teams cannot reason about when full service is safe again. Common wrong answer to avoid: "Because automation requires fixed thresholds everywhere." Judgment can still exist; it just needs policy boundaries. Q: Why measure degraded mode separately from normal mode? A: It has different success, complaint, and risk characteristics, so blended metrics hide real user experience. Common wrong answer to avoid: "Because degraded mode traffic is too small to matter." Small traffic can still carry high importance.


Apply now (5 min)

Exercise. Take one AI feature and write a degradation ladder with four levels. For each level, list what stays, what goes, and what the user should be told.

Sketch from memory. Draw the capability stack from safest to riskiest. Circle where the stability kit cuts first, and note when the senior doctor becomes the correct next step.


Bridge. Degradation works only if the system knows how long to keep trying before stepping down. Next we divide time carefully with timeout budgets. → 07-timeout-management.md