Skip to content

03. Design Notification System

⏱️ Estimated time: 20 min | Level: advanced

ELI5 callback: On this stage, you are proposing a city alert network. Start with the blueprint, then let the choreography explain channels, retries, and user choice.

Step 1: Requirements & Constraints

See. First trap is solving the wrong question. Ask crisp questions, then freeze scope.

Functional requirements - Send notifications over push, email, and SMS, because scope must stay explicit. - Respect user preferences, quiet hours, and channel opt-outs, because scope must stay explicit. - Support templates, localization, and variable substitution, because scope must stay explicit. - Provide delivery status, retries, and dead-letter handling, because scope must stay explicit. - Deduplicate repeated events so users are not spammed, because scope must stay explicit.

Non-functional requirements - Producer APIs should stay fast even during provider slowness, because that constraint changes architecture. - Delivery should be durable, with at-least-once semantics internally, because that constraint changes architecture. - Channel workers must isolate failures from each other, because that constraint changes architecture. - Preference checks should be correct and easy to audit, because that constraint changes architecture. - Operators need strong visibility into drop and retry reasons, because that constraint changes architecture.

Constraints and assumptions - Assume 50 million daily active users, so your estimate stays grounded. - Assume 10 notifications per active user per day on average, so your estimate stays grounded. - That gives roughly 500 million notifications per day, so your estimate stays grounded. - Assume peak bursts during campaigns can be 20 times average, so your estimate stays grounded.

What to explicitly de-scope - Full marketing campaign authoring UI is out for now, because interview time is limited. - Machine-learning send-time optimization can come later, because interview time is limited. - Inbox product surfaces are separate from outbound delivery, because interview time is limited. - Voice-call escalation is a later enterprise feature, because interview time is limited.

On the stage, say what is in and out. That makes the choreography visible and saves time.

Step 2: Scale Estimation

Now watch. Use round numbers, not thesis-level math. One minute of math can remove ten minutes of confusion.

Assumptions - 500 million per day is about 5,800 notifications per second average, so the back-of-envelope math stays honest. - With a 20x burst factor, peak traffic can exceed 100,000 per second, so the back-of-envelope math stays honest. - Provider APIs will have lower quotas than internal queues, so the back-of-envelope math stays honest. - Payload metadata is small; retries dominate operational cost, so the back-of-envelope math stays honest.

Quick math - If 5 percent need SMS fallback, that is still huge cost, which directly changes component choices. - If retry rate is 2 percent, queue traffic rises noticeably, which directly changes component choices. - Preference reads happen for every candidate notification, which directly changes component choices. - Template rendering can often be cached by campaign version, which directly changes component choices. - Provider callbacks arrive later and can be processed asynchronously, which directly changes component choices.

Capacity implications - Use durable queues between producers and channel workers, so the design stays proportional. - Batch when providers allow it, especially for email, so the design stays proportional. - Cache preference snapshots close to the routing service, so the design stays proportional. - Separate critical product alerts from bulk marketing traffic, so the design stays proportional.

Latency budget - Producer API should return after enqueue, not after delivery, because user feel matters early. - Preference lookup and routing should stay below tens of milliseconds, because user feel matters early. - Push delivery can be near-real-time, email less so, because user feel matters early. - Status dashboards can lag behind actual delivery slightly, because user feel matters early.

These numbers shape the first blueprint. Simple, no? Design follows load.

Step 3: High-Level Design

See. Keep the top-level flow boring and understandable. The interviewer rewards a clean blueprint before clever tricks.

┌──────────┐   ┌──────────────┐   ┌──────────────┐
│ producers│──→│ notify API   │──→│ routing svc  │
└──────────┘   └──────────────┘   └──────┬───────┘
                           ┌─────────────┼─────────────┐
                           │             │             │
                     ┌─────▼────┐ ┌──────▼─────┐ ┌─────▼─────┐
                     │ prefs    │ │ queue bus  │ │ template  │
                     │ service  │ │ per channel│ │ renderer  │
                     └──────────┘ └──────┬─────┘ └───────────┘
                           ┌──────────────┼──────────────┐
                           │              │              │
                     ┌─────▼────┐  ┌──────▼────┐  ┌──────▼────┐
                     │ push wkrs│  │ email wkrs│  │ SMS wkrs  │
                     └──────────┘  └───────────┘  └───────────┘

Main flow - Producer sends an event with user, template, and priority, so the read and write path stays clear. - Notification API validates schema and idempotency key, so the read and write path stays clear. - Routing service loads preferences and chooses eligible channels, so the read and write path stays clear. - A channel-specific message is enqueued for workers, so the read and write path stays clear. - Workers call providers and persist delivery attempts, so the read and write path stays clear. - Callbacks update status tables and retry schedules asynchronously, so the read and write path stays clear.

Data model sketch - Preference record stores channel opt-ins, quiet hours, and locale, so keys and queries stay obvious. - Notification event keeps idempotency key, priority, and template version, so keys and queries stay obvious. - Delivery attempt row tracks provider response, retry count, and state, so keys and queries stay obvious. - Dead-letter records preserve the final failure reason for debugging, so keys and queries stay obvious.

What to say aloud - Start by separating producer latency from provider latency, so the interviewer hears your structure. - Use reasoning aloud to explain why queues sit between routing and providers, so the interviewer hears your structure. - Mention that preference evaluation happens before expensive channel work, so the interviewer hears your structure. - State that channel isolation prevents one provider outage from blocking others, so the interviewer hears your structure.

Step 4: Deep Dive

So what to do? Pick two hotspots and go deeper. Do not deep dive everywhere.

Component 1: Preference evaluation and deduplication

Goal - Decide quickly whether a user should receive this notification, so the deep dive has a target. - Prevent duplicate sends from retried upstream events, so the deep dive has a target.

Design notes - Use a user preference store with a cache for hot users, because details must still map to scale. - Apply quiet-hour logic and channel fallbacks in the routing layer, because details must still map to scale. - Use idempotency keys keyed by event source and recipient, because details must still map to scale. - Store recent send fingerprints to avoid campaign duplicates, because details must still map to scale.

Component 2: Retry orchestration and delivery guarantees

Goal - Retry transient provider failures without spamming users, so the deep dive has a target. - Expose useful final states like delivered, failed, or abandoned, so the deep dive has a target.

Design notes - Use exponential backoff with channel-specific retry policies, because details must still map to scale. - Send exhausted messages to a dead-letter queue for inspection, because details must still map to scale. - Differentiate permanent errors like invalid tokens from transient ones, because details must still map to scale. - Persist attempt history so operators can explain delivery gaps, because details must still map to scale.

Use reasoning aloud to compare one easy option and one scalable option. Add an honest gap if exact thresholds are unknown.

Interviewer follow-ups to prepare - How do you stop a buggy producer from spamming everyone? - How do you support priority lanes for security alerts? - What changes if providers have strict per-second quotas? - How would you collapse many similar events into one digest?

Why not the simpler alternative? - Calling providers inline is simple, but terrible for producer latency, so tradeoffs stay visible. - One big shared queue is easy, but noisy neighbors become painful, so tradeoffs stay visible. - Strict exactly-once delivery sounds nice, but is rarely worth the cost, so tradeoffs stay visible. - Putting preference logic inside each worker duplicates business rules, so tradeoffs stay visible.

Step 5: Tradeoffs & Failure Modes

Now watch. Senior answers end with tradeoffs and breakage paths. That is where judgment shows up.

Tradeoffs - At-least-once delivery is robust, but requires deduplication logic, so the interviewer hears the cost clearly. - Per-channel queues isolate failures, but increase operational objects, so the interviewer hears the cost clearly. - Caching preferences cuts latency, but risks short-lived stale reads, so the interviewer hears the cost clearly. - Aggressive retries improve delivery, but can annoy users and providers, so the interviewer hears the cost clearly. - Digesting notifications saves cost, but reduces immediacy, so the interviewer hears the cost clearly.

Failure modes - Provider outage can back up one channel queue rapidly, because real systems always break somewhere. - Bad preference cache invalidation can violate user settings, because real systems always break somewhere. - Template bugs can fail whole campaigns across channels, because real systems always break somewhere. - Callback loss can leave status stuck in an unknown state, because real systems always break somewhere. - Retry storms can amplify provider incidents, because real systems always break somewhere.

Recovery levers - Pause non-critical traffic when provider health drops, so failure discussion ends with action. - Use separate priority queues for security and transactional alerts, so failure discussion ends with action. - Replay callback events from provider exports if necessary, so failure discussion ends with action. - Expose kill-switches per template and per producer, so failure discussion ends with action.

Close with an honest gap on one metric you would validate live. That sounds calm, not weak.

Interview Q&A

Q1. Why not send directly from the producer service? A: Because producer services should not block on flaky third-party providers. A notification platform isolates that variability. Common wrong answer to avoid: Each producer should manage its own provider integrations.

Q2. Why are queues central here? A: Because queues absorb bursts, isolate channels, and support retries cleanly. They turn spiky demand into manageable worker flow. Common wrong answer to avoid: A single queue for everything is always simpler and good enough.

Q3. How do you avoid duplicate notifications? A: Use idempotency keys on incoming events and dedupe windows around downstream sends. Retries then become safe. Common wrong answer to avoid: Exactly-once delivery is easy if you just retry carefully.

Q4. When would SMS be chosen over push? A: When the message is urgent and the user allows it, or when push delivery is impossible and fallback rules permit extra cost. Common wrong answer to avoid: SMS should be the default because it is reliable.

Apply now (5 min) — practice exercise

Take five minutes. Do this without notes.

Practice checklist - Pick one event and map its preferred channel order, so your rehearsal stays focused. - Estimate peak queue depth during a campaign burst, so your rehearsal stays focused. - Draw the routing and worker split in under thirty seconds, so your rehearsal stays focused. - Explain your retry policy for transient failures, so your rehearsal stays focused. - Name one operator kill-switch you would add, so your rehearsal stays focused.

Self-check - Did you separate producer speed from delivery speed? - Did you mention idempotency explicitly? - Did you isolate channels operationally? - Did you show how preferences affect routing?

Say this opening - Open by listing the channels and user controls, so your first minute sounds controlled. - Then separate synchronous enqueue from asynchronous delivery, so your first minute sounds controlled. - End with retries, DLQ, and provider failure handling, so your first minute sounds controlled.

Run the choreography once in short form, then once with details. Stay aware of the stage and pause for questions.

Bridge. Notifications sent. Now the content itself — a news feed. → 04