10. Error Handling and Contracts — a good receipt explains the mistake properly¶
~12 min read. Bad failures waste time because clients cannot see the next safe step.
Built on the ELI5 in 00-eli5.md. The receipt — the response — should say what failed, why it failed, and what to do now.
1) Vague failures make every client guess¶
A bad restaurant receipt says only, "error happened."
Nobody knows whether the card failed, the soup spilled, or the kitchen closed.
APIs create the same mess when they return only 500 and a vague sentence.
Good errors are part of the product, not leftover plumbing. They help callers recover, help operators debug, and help dashboards stay honest. One failure contract should guide humans and machines together.
bad receipt
client ──→ API ──→ 500 "error"
└─→ retry, guess, support ticket
good receipt
client ──→ API ──→ status + code + detail + next step
└─→ fix input, retry later, or stop
A useful error response answers four questions quickly.
- What category failed?
- Which field or action caused it?
- Can the client fix it now?
- Should the client retry later?
Worked example.
Suppose POST /orders receives quantity = -2.
A weak response is useless.
A stronger response teaches action.
{
"status": 400,
"title": "Validation failed",
"code": "INVALID_QUANTITY",
"detail": "quantity must be greater than zero",
"field": "quantity"
}
Now the caller can highlight the broken field on the order slip. Support can search the stable code. Logs can count the category cleanly. Simple, no?
2) RFC 7807 gives every bad receipt the same shape¶
RFC 7807 Problem Details says, "Use one predictable envelope for failures." That standard reduces needless reinvention across services and teams.
Core fields are small, but they do real work.
typenames the broad problem family.titlegives a short human label.statusrepeats the HTTP status code.detailexplains this exact occurrence.instanceidentifies the specific failing request.
Think of it as a printed receipt layout. Every response looks familiar even when the mistake changes.
┌──────────────────────────────────────────────┐
│ type = problem family │
│ title = short label │
│ status = HTTP code │
│ detail = this exact incident │
│ instance = traceable occurrence │
└──────────────────────────────────────────────┘
You can still add extension fields for domain needs.
Common ones are code, field, violations, docs_url, or retryable.
The standard gives the frame. Your product details sit inside that frame.
Worked example with multiple validation issues.
{
"type": "https://api.example.com/problems/validation-error",
"title": "Validation failed",
"status": 422,
"detail": "Two fields need correction",
"code": "VALIDATION_ERROR",
"violations": [
{ "field": "email", "message": "must be a valid email" },
{ "field": "age", "message": "must be at least 18" }
]
}
This helps clients loop through violations without scraping English sentences.
It also keeps every service response familiar to the same SDK or UI.
One caution matters here.
Do not dump raw stack traces into detail. Clarity is required. Leakage is not.
3) Error codes, messages, and categories each solve different problems¶
Teams sometimes ask, "Why not just return a message?" Because messages change. Writers improve wording. Products localize text. Machines should not break because a sentence improved.
Use stable codes for software behavior. Use helpful messages for humans. Keep both together.
Worked example.
A payment API may return CARD_EXPIRED with a sentence explaining the month.
Next month, product can rewrite the sentence. The app still branches on the code.
That is why code and message are partners, not substitutes.
Categories matter too because recovery differs.
validation/client issue ──→ 400 or 422
auth issue ──→ 401 or 403
missing resource ──→ 404
conflict/business rule ──→ 409
rate limited ──→ 429
dependency/server issue ──→ 500 / 502 / 503 / 504
Worked example.
A duplicate email on POST /users is not a random crash.
It is a conflict, so 409 plus EMAIL_ALREADY_EXISTS is sensible.
A database timeout is a server or dependency issue. That deserves a different tone.
A malformed field is validation. Never hide that behind 500.
These categories improve operations too.
Validation spikes may mean a client rollout broke.
429 spikes may mean abusive traffic or tiny quotas.
503 spikes may point at a sick dependency.
A thoughtful receipt teaches both callers and on-call engineers.
4) Error contracts must be published, tested, and kept stable¶
Many teams document success responses carefully and improvise failures later. That is incomplete API design. Failures are part of the interface.
Document at least these parts for each endpoint family.
- expected status codes
- shared Problem Details shape
- stable error codes
- field-level validation format
- retry guidance when relevant
A tiny contract table is often enough.
endpoint status code retry?
POST /orders 422 INVALID_QUANTITY no
POST /orders 409 OUT_OF_STOCK maybe after change
POST /orders 503 INVENTORY_UNAVAILABLE yes
Then test those contracts like normal API behavior.
Consumer tests should verify status, code, and required fields.
If violations disappears silently, the contract already broke.
The wait staff idea helps here. Middleware often formats errors centrally. Good. But central formatting must preserve category and detail, not flatten every problem into the same blob.
handler raises typed failure
│
▼
error middleware maps category
│
▼
Problem Details response
│
├── client sees fix or retry hint
└── logs keep trace id and cause
One last worked example.
Mobile app version 6.2 expects code. Backend 6.3 renames it to error_code.
HTTP status stays 422, so humans barely notice. The app logic breaks instantly.
That is contract drift. Treat error fields like any other shipped API field.
Where this lives in the wild¶
- Stripe API platform engineer designs payment failures with stable codes and retry hints clients can automate against.
- GitHub developer platform engineer shapes REST errors so integrations know whether to fix scopes, payloads, or retry timing.
- Twilio messaging API engineer separates validation mistakes from carrier or upstream delivery failures for predictable recovery.
- Razorpay backend engineer maps bank outages, payment conflicts, and idempotency issues into contract-safe responses.
- Swiggy partner API engineer returns field-level onboarding errors so merchant dashboards can guide corrections directly.
Pause and recall¶
- Why are stable error codes different from helpful error messages?
- Which RFC 7807 field identifies the broad problem family?
- Why should validation failures not hide behind
500? - What made the mobile app break in the contract-drift example?
Interview Q&A¶
Q: Why use RFC 7807 instead of a custom error envelope everywhere? A: It gives a standard base shape, reduces inconsistency, and still allows extensions. Common wrong answer to avoid: "Because standards are always mandatory" — the real value is predictable integration and less invented chaos.
Q: Why keep both error codes and human messages? A: Codes stay stable for software behavior, while messages teach people what happened. Common wrong answer to avoid: "Because messages are only for non-technical users" — engineers need messages too; the point is different jobs.
Q: Why separate validation, client, and server categories? A: Each category implies different status codes, retry rules, and operational responses. Common wrong answer to avoid: "Because HTTP has many status codes" — category design is about recovery logic, not memorization.
Q: Why test error contracts like success contracts? A: Clients automate against error fields too, so silent shape drift becomes breaking change. Common wrong answer to avoid: "Because errors are rare" — rare paths still fail real customers and often matter most during stress.
Apply now (5 min)¶
Exercise: Design the error contract for POST /bookings.
Include one validation error, one conflict error, and one dependency error.
Write status, code, and one helpful detail sentence for each.
Sketch from memory: Draw a request flowing through handler, wait staff middleware, and a final receipt. Mark where category, code, and trace id appear.
Bridge. Clear failures help clients recover. Next we control request volume before failures multiply. → 11-rate-limiting-and-throttling.md