Skip to content

09. Exception Hierarchies — neat pipes for messy failures

~14 min read. Real systems do not fail once; they leak in patterns.

Built on the ELI5 in 00-eli5.md. The plumbing — data and failure flow inside the service — needs tidy paths so one broken room does not flood every hallway.


1) Give failures families before you start catching them

Many teams begin with throw new Exception() everywhere. That is like marking every pipe simply as “liquid.” Drinking water and sewage now look identical. See. The caller cannot react correctly anymore.

A small hierarchy gives failures handling meaning. It is not ceremony. It is routing information for code and operators.

┌───────────────────────┐
│ AppException          │
├───────────┬───────────┤
│Validation │Dependency │
│DomainRule │Concurrency│
│Bug        │Auth       │
└───────────┴───────────┘

Useful categories in a service usually look like this.

  • Validation error: shape, format, range, or missing field is wrong.
  • Domain rule error: data is valid, but the business says no.
  • Dependency error: database, cache, queue, or partner is unhappy.
  • Concurrency error: stale version, deadlock victim, or lock timeout.
  • Internal bug: invariant broke inside your own room.

Java-style sketch:

sealed class AppException extends RuntimeException permits
    ValidationException, DomainRuleException,
    DependencyException, ConcurrencyException,
    InternalBugException {}

Now the service can map errors with intent. Validation may become HTTP 400. Domain rule may become 409 or 422. Dependency issue may become 503 with retry hints. Internal bug should page engineers, not the customer. Simple, no?

Worked example with numbers. Suppose 10,000 checkout requests arrive this hour. 120 fail due to bad coupon format. 35 fail because Redis timed out. 4 fail due to version mismatch on inventory row. 2 fail because a null invariant slipped through. One counter for all “exceptions” hides the real story. Separate families make dashboards actionable.

So what to do? Write the hierarchy first. Then map each class to status code, log level, metric name, and alert rule. That one page cleans the plumbing for months.

2) Checked, unchecked, and result types solve different pains

Students ask which one is correct. The better question is, “Correct for what kind of failure?” Each tool changes call-site behaviour.

Caller ──▶ Method
          ├─▶ checked exception
          ├─▶ unchecked exception
          └─▶ result type

Checked exceptions are useful when recovery is expected nearby. The caller must notice and decide. That friction is sometimes healthy.

Invoice loadInvoice(String id) throws InvoiceNotFoundException {
    ...
}

If the screen can honestly show “invoice not found,” this is fine. The failure is normal enough to deserve forced handling.

Unchecked exceptions suit broken contracts and programmer mistakes. They say, “Do not decorate this; fix the code.”

void applyDiscount(Money price) {
    if (price.isNegative()) throw new IllegalArgumentException();
}

A negative price here is not business rejection. It is a coding bug or violated precondition. Making every caller catch this adds noise, not safety.

Result types shine when failure is expected in normal flow. A payment decline is common business reality, not cosmic disaster.

sealed interface PaymentResult {
    data class Success(val paymentId: String) : PaymentResult
    data class Declined(val code: String, val reason: String) : PaymentResult
}

Worked example. Imagine 500 payment attempts this minute. 450 succeed. 38 decline for bank reasons. 9 fail validation before bank call. 3 time out at the gateway. Using one exception style for all four outcomes confuses readers. Result types fit the 38 common declines. Exceptions fit the 3 genuine technical disruptions.

A practical rule works well.

  • Boundary validation: structured result or error object.
  • Expected domain rejection: result type reads clearly.
  • Transient dependency trouble: exception to a retry seam.
  • Invariant break: unchecked exception, fail fast, fix root cause.

See. The choice is about failure shape, not language fashion. If one API throws, returns null, and also returns status enums, your hallway is already slippery.

3) Retry with backoff inside one clear service seam

Retry is medicine. Correct dose helps. Overdose creates new disease. When a dependency is already coughing, blind retries pile traffic on it.

Good retry logic answers four things.

  • Is the error transient?
  • Is the operation idempotent?
  • How many attempts are allowed?
  • How much time passes between attempts?

A simple backoff picture:

Attempt 1  ──fail── wait 100 ms
Attempt 2  ──fail── wait 250 ms
Attempt 3  ──fail── wait 600 ms
Attempt 4  ──give up and wrap

The waits grow because pressure should reduce, not amplify. Jitter matters too. If 2,000 requests retry at the exact same millisecond, you create a drumbeat attack. Small randomness spreads the load.

Service-level sketch:

Quote fetchQuote(String sku) {
    for (int attempt = 1; attempt <= 4; attempt++) {
        try {
            return pricingClient.fetch(sku);
        } catch (TimeoutException e) {
            sleep(withJitter(backoffMs(attempt)));
        }
    }
    throw new DependencyException("pricing unavailable");
}

Worked numbers make the danger obvious. Suppose 8,000 quote calls happen in ten minutes. 5% time out once. That means 400 first failures. With two retries maximum, you may add 800 extra calls. If the downstream was already saturated, those 800 matter a lot. So what to do? Retry only clearly transient failures. Retry only idempotent reads, or writes with idempotency keys. Keep the loop in one place, not three nested clients. Otherwise 3 x 3 x 3 becomes 27 calls from one original request. That is not resilience. That is traffic multiplication in fancy English.

4) Propagate meaning upward, then degrade gracefully at the edge

Good propagation preserves category and context. Bad propagation either leaks raw internals or erases the cause completely. The higher layer needs meaning, not database gossip.

DB timeout
Repository wraps as DependencyException
Service adds orderId and retry policy context
API maps to 503 or fallback response

Repository sketch:

try {
    dao.save(order);
} catch (SQLException e) {
    throw new DependencyException("save order failed", e);
}

Service sketch:

try {
    repository.save(order);
} catch (DependencyException e) {
    throw new DependencyException("order " + order.id() + " not persisted", e);
}

Now graceful degradation. This does not mean swallowing every issue and smiling. It means returning a reduced but honest experience. See. The customer still gets value. Operators still get a signal.

Example. Product page wants price, stock, and reviews. Price and stock are core. Reviews are nice, but not checkout-critical. If review service times out after 700 ms, maybe return the page without reviews in 80 ms using cached metadata. That is graceful degradation at code level.

fun loadProductPage(id: String): ProductPage {
    val core = productService.loadCore(id)
    val reviews = try {
        reviewClient.fetch(id)
    } catch (e: TimeoutException) {
        emptyList()
    }
    return ProductPage(core = core, reviews = reviews, reviewStatus = if (reviews.isEmpty()) "degraded" else "fresh")
}

Worked example with numbers. Without degradation, 3% review timeouts may make 3% whole-page failures. With degradation, maybe only review freshness drops for those requests. Conversion impact stays small. Support tickets stay smaller. But honesty matters. Do not show stale data as real-time truth. Expose status, log the fallback, and count it. Simple, no?


Where this lives in the wild

At Razorpay, a payments backend engineer separates bank declines from gateway outages inside one error taxonomy. At Swiggy, a reliability engineer adds bounded retries with jitter around restaurant menu fetches. At Amazon, a marketplace engineer wraps raw storage exceptions before they leave the repository layer. At PhonePe, a service engineer returns a degraded balance insights widget when analytics dependencies lag. At Stripe, a staff backend engineer uses result types for expected payment outcomes and exceptions for transport failures.


Pause and recall

  1. Which failures in your current service deserve separate classes, and why? 2. When is a result type clearer than a checked exception? 3. Why is one central retry seam safer than nested retry loops? 4. What can degrade gracefully on a page without lying to the user?

Interview Q&A

Q1) Why exception hierarchy not one generic AppException?

Because handling policy travels with category. One giant bucket forces text matching and vague dashboards. Common wrong answer to avoid: “One type is cleaner because all failures are basically same.”

Q2) Why unchecked exception not checked exception for invariant bugs?

Because the caller cannot truly recover from your broken assumption. Forced catch blocks would only spread defensive noise. Common wrong answer to avoid: “Checked exceptions are always safer, so use them for everything.”

Q3) Why retry with backoff inside one service seam not every layer?

Because distributed retries multiply traffic and hide ownership. One seam can see idempotency, limits, metrics, and fallback policy together. Common wrong answer to avoid: “More retry loops mean higher success probability.”

Q4) Why graceful degradation not silent swallowing?

Because degraded output should stay honest and observable. Silent swallowing hides incidents and returns misleading data. Common wrong answer to avoid: “Users are happier if we hide every failure completely.”


Apply now (5 min)

Exercise: Take one endpoint you know well. List four error families, one retryable dependency, one non-retryable failure, and one honest fallback. Write the mapping from error family to status code and log level.

Sketch from memory: Draw the layer flow from repository to service to API. Mark where wrapping happens, where retry happens, and where degradation is allowed.


Bridge. Once failures are classified, the next headache is overlapping work. See how shared pipes behave when many workers touch them together. → 10-concurrency-patterns.md