01. Design URL Shortener¶

⏱️ Estimated time: 20 min | Level: advanced

ELI5 callback: On this stage, you are proposing a city shortcut plan. Show the blueprint first, then let the choreography guide the council through latency and scale.

Step 1: Requirements & Constraints¶

See. First trap is solving the wrong question. Ask crisp questions, then freeze scope.

Functional requirements - Shorten a long URL into a unique short code, because scope must stay explicit. - Redirect a short code to the original URL fast, because scope must stay explicit. - Support custom aliases for premium users, because scope must stay explicit. - Track click counts, referrers, and geography for analytics, because scope must stay explicit. - Allow expiration or deletion for abuse and compliance, because scope must stay explicit.

Non-functional requirements - Redirect latency should stay low, ideally under 100 ms, because that constraint changes architecture. - Reads dominate writes, so read path must scale cheaply, because that constraint changes architecture. - Short codes should avoid collisions and bad-word patterns, because that constraint changes architecture. - System should survive node loss without losing mappings, because that constraint changes architecture. - Analytics can be slightly delayed, but redirects cannot, because that constraint changes architecture.

Constraints and assumptions - Assume 100 million new short URLs per month, so your estimate stays grounded. - Assume 20 billion redirects per month at peak global traffic, so your estimate stays grounded. - Assume most URLs are public and not access-controlled, so your estimate stays grounded. - Assume alias conflicts are rare and resolved synchronously, so your estimate stays grounded.

What to explicitly de-scope - Real-time fraud detection can be a later extension, because interview time is limited. - Preview pages and malware scanning can be async add-ons, because interview time is limited. - Enterprise folder management is out for this round, because interview time is limited. - QR code generation is useful, but not core today, because interview time is limited.

On the stage, say what is in and out. That makes the choreography visible and saves time.

Step 2: Scale Estimation¶

Now watch. Use round numbers, not thesis-level math. One minute of math can remove ten minutes of confusion.

Assumptions - 100 million new URLs per month means about 3.3 million per day, so the back-of-envelope math stays honest. - That is roughly 38 writes per second on average, so the back-of-envelope math stays honest. - Use a 10x peak factor and call it 400 writes per second, so the back-of-envelope math stays honest. - 20 billion redirects per month is about 7,700 reads per second average, so the back-of-envelope math stays honest.

Quick math - Peak redirects ≈ 7,700 × 5 = 38,500 reads per second, which directly changes component choices. - Average metadata row size ≈ 500 bytes with indexes, which directly changes component choices. - Monthly metadata growth ≈ 100,000,000 × 500 B = 50 GB, which directly changes component choices. - Five-year metadata storage stays manageable with sharding, which directly changes component choices. - Analytics events are far larger than the URL mapping table, which directly changes component choices.

Capacity implications - Primary pain is redirect latency, not code generation throughput, so the design stays proportional. - Cache hot mappings close to users and origin apps, so the design stays proportional. - Keep write path simple because writes are modest, so the design stays proportional. - Move analytics off the critical redirect path, so the design stays proportional.

Latency budget - DNS and TLS already consume part of the budget, because user feel matters early. - Application lookup should ideally stay within 10 to 20 ms, because user feel matters early. - Cache miss plus database read should stay within 50 ms, because user feel matters early. - Analytics enqueue must be fire-and-forget, because user feel matters early.

These numbers shape the first blueprint. Simple, no? Design follows load.

Step 3: High-Level Design¶

See. Keep the top-level flow boring and understandable. The interviewer rewards a clean blueprint before clever tricks.

┌─────────┐   ┌────────────┐   ┌─────────────┐
│ client  │──→│ edge/CDN   │──→│ app servers │
└─────────┘   └────────────┘   └──────┬──────┘
                                      │
                          ┌───────────┼───────────┐
                          │           │           │
                    ┌─────▼────┐ ┌────▼─────┐ ┌────▼──────┐
                    │ Redis    │ │ URL DB   │ │ Event Bus │
                    │ hot keys │ │ mappings │ │ analytics │
                    └──────────┘ └──────────┘ └────┬──────┘
                                                    │
                                              ┌─────▼─────┐
                                              │ Analytics │
                                              │ pipeline  │
                                              └───────────┘

Main flow - Create request hits app servers through a load balancer, so the read and write path stays clear. - App generates or validates the short code, so the read and write path stays clear. - Mapping is stored in the URL database and cached, so the read and write path stays clear. - Redirect request first checks Redis for the short code, so the read and write path stays clear. - Cache miss falls back to the database and then backfills cache, so the read and write path stays clear. - Analytics event is pushed asynchronously after redirect decision, so the read and write path stays clear.

Data model sketch - short_code → long_url is the hottest lookup key, so keys and queries stay obvious. - Store created_at, owner_id, expiry_at, and status fields, so keys and queries stay obvious. - Maintain alias ownership index for premium custom names, so keys and queries stay obvious. - Send click events to a separate append-only analytics store, so keys and queries stay obvious.

What to say aloud - Start with redirects because that is the hottest user journey, so the interviewer hears your structure. - Use reasoning aloud to explain why cache sits before the database, so the interviewer hears your structure. - Mention that analytics stays async because user latency matters more, so the interviewer hears your structure. - State that custom aliases need a uniqueness check on write, so the interviewer hears your structure.

Step 4: Deep Dive¶

So what to do? Pick two hotspots and go deeper. Do not deep dive everywhere.

Component 1: Code generation and uniqueness¶

Goal - Generate short codes without collisions under concurrent writes, so the deep dive has a target. - Keep codes compact enough for user-friendly sharing, so the deep dive has a target.

Design notes - A global counter plus base62 encoding is simple and predictable, because details must still map to scale. - Hashing the long URL helps dedupe, but collisions still need handling, because details must still map to scale. - Pre-generating code blocks per app shard removes a central bottleneck, because details must still map to scale. - Reserve blocked words and confusing combinations during generation, because details must still map to scale.

Component 2: Redirect path and cache behavior¶

Goal - Serve hot redirects in a single network round trip, so the deep dive has a target. - Prevent cache stampede when a viral code suddenly spikes, so the deep dive has a target.

Design notes - Use Redis with TTL plus lazy refresh for hot keys, because details must still map to scale. - Negative-cache unknown codes briefly to stop repeated misses, because details must still map to scale. - Add request coalescing or mutex locking for miss storms, because details must still map to scale. - Return cached permanent redirect metadata with versioning if needed, because details must still map to scale.

Use reasoning aloud to compare one easy option and one scalable option. Add an honest gap if exact thresholds are unknown.

Interviewer follow-ups to prepare - How would you handle vanity alias abuse and squatting? - What changes if redirects must be geo-aware for experiments? - How would you isolate analytics traffic from redirect traffic? - When would you shard by code prefix versus by range?

Why not the simpler alternative? - A pure hash-only scheme feels elegant, but collision handling gets messy, so tradeoffs stay visible. - A single SQL instance works early, but hotspot growth arrives fast, so tradeoffs stay visible. - Putting analytics inline gives accuracy, but hurts redirect latency, so tradeoffs stay visible. - Overusing strong consistency here costs more than it helps, so tradeoffs stay visible.

Step 5: Tradeoffs & Failure Modes¶

Now watch. Senior answers end with tradeoffs and breakage paths. That is where judgment shows up.

Tradeoffs - Counter-based IDs are simple, but reveal growth rate roughly, so the interviewer hears the cost clearly. - Hash-based IDs hide sequence, but need collision handling logic, so the interviewer hears the cost clearly. - Long cache TTL helps latency, but slows propagation of deletes, so the interviewer hears the cost clearly. - Async analytics protects user experience, but delays dashboards, so the interviewer hears the cost clearly. - Global uniqueness checks are safe, but can bottleneck vanity writes, so the interviewer hears the cost clearly.

Failure modes - Redis outage increases database load and redirect latency, because real systems always break somewhere. - Database shard outage makes a subset of links unavailable, because real systems always break somewhere. - Event bus lag delays analytics and abuse signals, because real systems always break somewhere. - Buggy cache invalidation can keep deleted links alive briefly, because real systems always break somewhere. - Hot-key attacks can overload one partition without rate controls, because real systems always break somewhere.

Recovery levers - Serve stale cache for a short grace period during database trouble, so failure discussion ends with action. - Throttle suspicious hot keys at the edge, so failure discussion ends with action. - Replay analytics events from the queue after consumer recovery, so failure discussion ends with action. - Keep a fast admin kill-switch for malicious links, so failure discussion ends with action.

Close with an honest gap on one metric you would validate live. That sounds calm, not weak.

Interview Q&A¶

Q1. Why not generate codes by hashing the full URL only? A: Because identical URLs may need different owners, expiries, or analytics identities. Also, collisions and canonicalization rules complicate the write path. Common wrong answer to avoid: Hashing automatically solves uniqueness and scale for free.

Q2. Why is Redis worth it if the database can already read fast? A: Because redirects are extremely read-heavy and repetitive. A cache absorbs hot traffic and protects the database from viral spikes. Common wrong answer to avoid: A cache is optional because modern databases are already fast.

Q3. Would you store analytics in the same database table? A: No. Redirect lookups need low-latency key access, while analytics wants append-heavy writes and flexible aggregations. Common wrong answer to avoid: One table is enough for redirects and analytics forever.

Q4. When would you move from one database to shards? A: Shard when storage, throughput, or operational blast radius exceeds one node. The code space is a natural partition key. Common wrong answer to avoid: Sharding should happen immediately on day one.

Apply now (5 min) — practice exercise¶

Take five minutes. Do this without notes.

Practice checklist - State the hottest request path in one sentence, so your rehearsal stays focused. - Estimate peak read QPS and cache hit ratio target, so your rehearsal stays focused. - Draw the redirect path in under thirty seconds, so your rehearsal stays focused. - Explain one code-generation option and one downside, so your rehearsal stays focused. - Name one abuse-control mechanism for malicious URLs, so your rehearsal stays focused.

Self-check - Did you separate redirect latency from analytics freshness? - Did you mention collision handling explicitly? - Did you say what happens during cache failure? - Did you keep the write path simpler than the read path?

Say this opening - Open with requirements, then ask one abuse-related question, so your first minute sounds controlled. - Do the math aloud, even if the numbers are rough, so your first minute sounds controlled. - Sketch the read path before the write path, so your first minute sounds controlled.

Run the choreography once in short form, then once with details. Stay aware of the stage and pause for questions.

Bridge. URL shortener done. Now something that protects APIs — a rate limiter. → 02