07. Refusal logic — the no-fly desk decides when not to board the request¶

~14 min read. A useful assistant must know when honesty is safer than confidence.

Built on the ELI5 in 00-eli5.md. The no-fly desk — the checkpoint that denies unsafe trips — is where the system chooses refusal, abstention, or narrow safe help.

Refusal is not failure; uncontrolled confidence is¶

Teams sometimes fear refusals because they reduce engagement. That fear is understandable. It is also expensive when taken too far.

If the assistant is outside scope, lacks evidence, or lacks authority, then a confident answer is often the real failure. The no-fly desk exists to preserve truthfulness and bounded authority.

request
  │
  ▼
policy + confidence check
  │
  ├── in scope + grounded + authorized ──→ answer
  ├── partly safe but limited ───────────→ narrow safe help
  └── unsafe / out of scope / uncertain ─→ refuse

See the middle branch. Good refusal logic is not only yes or no. Sometimes the system should decline the dangerous part and still help with a safe adjacent step.

Three reasons to refuse¶

First, safety. The request seeks harmful instructions, illegal assistance, self-harm support, or clearly disallowed content. Here refusal is policy-driven.

Second, scope. The assistant is being asked to do a job it was never designed or approved to do. A customer support bot should not draft medical advice. A coding bot should not approve refunds. A retrieval bot should not invent legal interpretation beyond cited policy.

Third, uncertainty. The assistant lacks evidence, confidence, or needed context. In these cases, "I do not know" is not weakness. It is disciplined behavior.

The no-fly desk should receive signals from many places. Moderation labels. Groundedness scores. Tool availability. Confidence estimates. Business-policy rules. Role boundaries. Refusal is a systems decision, not a poetic mood.

Worked example: scope boundary versus helpful alternative¶

Suppose a bank support bot gets this request.

"Tell me which stock to buy today, and also reverse my debit card charge now."

Look carefully. One request contains two jobs. Investment advice is out of scope. Charge investigation may be in scope but probably requires secure verification.

A bad assistant answers like this.

"Buy XYZ Bank shares. I have reversed your charge already."

That is a double failure. It gave regulated advice and claimed authority it did not have.

A better no-fly desk response looks like this.

request decomposition
├── stock recommendation        → refuse: out of scope
└── charge reversal request     → limited help: ask for secure workflow

safe reply
"I cannot give investment recommendations.
I also cannot reverse charges directly here.
I can help you start the card-dispute process through the secure form."

Simple, no? Refusal did not mean silence. It meant bounded honesty plus the next safe step.

Confidence thresholds need calibration, not vibes¶

Now what is the tricky part? Teams say, "Refuse when confidence is low." Fine. Confidence from where?

A language model's own verbal confidence is weak. We need operational signals. Retrieval coverage. Citation match. Classifier scores. Tool success. Historical accuracy on similar tasks. Whether required fields are present. These are more useful than the model saying, "I am 95% sure."

One practical pattern uses a small decision table.

signal set
├── retrieved evidence found?      yes / no
├── citations verified?            yes / no
├── tool result authoritative?     yes / no
├── policy category allowed?       yes / no
└── uncertainty score              0..1

if allowed=no              → refuse
if evidence=no             → abstain or ask clarifying question
if tool authority=no       → refuse action claim
if uncertainty > threshold → narrow answer or refuse

The threshold should depend on harm. A movie recommendation can tolerate more uncertainty than medical triage. The no-fly desk should be strict where downside is high.

Out-of-scope handling should be explicit in product design¶

Users do not know your internal boundaries. So make them visible. Define scope in policy text, evaluation rubrics, and refusal templates.

For example, a tax assistant may answer filing deadlines from cited documents but refuse personalized legal advice. A code assistant may explain security concepts but refuse exploit payloads. A tutoring bot may explain chemistry but refuse unsafe lab procedures. These are design decisions.

Look. The worst refusal systems are inconsistent. They refuse similar questions differently on different days. That usually means the boundary lives only in prompts and nowhere else. Mature teams encode scope rules in evaluators, classifiers, tool permissioning, and review guides.

The no-fly desk should be explainable. Why did we refuse? Which rule fired? Which evidence was missing? Operators need answers. So do auditors.

Refusal should stay honest, brief, and useful¶

A refusal is not a lecture. It should say what cannot be done, maybe why at a policy level, and offer a safe next step if one exists.

Bad refusal: "As an AI language model, I must emphasize that human civilization depends on responsible communication..."

Better refusal: "I cannot provide instructions for bypassing apartment locks. I can suggest legal steps for lockout recovery or emergency contact options."

See the structure. Boundary. Short reason. Safe alternative. No fake empathy overload. No moral performance.

The arrival customs layer should still check refusal text for leaks or unsafe leftovers. Even refusal content can accidentally include disallowed detail if poorly templated.

Where this lives in the wild¶

GitHub Copilot Chat — product safety engineer: refuses exploit-building requests while still helping with defensive code review and patching.
Khanmigo — classroom safety designer: declines to give direct exam answers but can guide the student through reasoning steps.
Enterprise HR assistants — people operations architect: refuse legal interpretation or private compensation speculation while pointing users to approved channels.
Banking chatbots — compliance lead: decline regulated financial advice and route users into authenticated transaction workflows.
Clinical assistant copilots — healthcare product manager: abstain when evidence is missing and escalate to human review for high-risk medical decisions.

Pause and recall¶

What are the three main reasons a system should refuse?
Why is narrow safe help often better than a full hard block?
Why is model-stated confidence alone weak for refusal decisions?
What makes a refusal explainable to operators?

Interview Q&A¶

Q: Why use explicit refusal policies instead of letting the model improvise caution case by case? A: Because improvised caution is inconsistent, hard to audit, and vulnerable to pressure from phrasing, while explicit policies create stable boundaries. Common wrong answer to avoid: "Because refusal quality does not affect user trust anyway."

Q: Why should uncertainty trigger abstention in high-risk domains but not always in low-risk domains? A: Because the cost of a wrong answer is domain-dependent, so refusal thresholds should reflect harm asymmetry rather than one global rule. Common wrong answer to avoid: "Because low-risk domains never contain harmful mistakes."

Q: Why separate out-of-scope refusal from harmful-content refusal? A: Because one is about product authority and contract, while the other is about safety policy, and the remediation path differs for each. Common wrong answer to avoid: "Because only harmful-content refusals need documentation."

Q: Why should refusals offer a next safe step when possible? A: Because bounded help preserves usefulness without pretending capability, which improves trust and task completion together. Common wrong answer to avoid: "Because every refusal must end with a human handoff."

Apply now (5 min)¶

Exercise. Take one assistant you know and list three asks it should answer, three it should answer only partially, and three it should refuse. Label each with safety, scope, or uncertainty. Then write one short refusal template for the no-fly desk.

Sketch from memory. Draw the decision split: answer, narrow help, refuse. Under each branch, add one trigger signal like policy disallow, low evidence, or missing authority.

Bridge. Refusal handles unsafe or uncertain asks. But what about answers that look acceptable on the surface and are still factually ungrounded? Next we inspect hallucination detection. → 08-hallucination-detection.md