07. Design YouTube¶
⏱️ Estimated time: 20 min | Level: advanced
ELI5 callback: On this stage, you are pitching a city media network. Show the blueprint first, then let the choreography explain uploads, playback, and global reach.
Step 1: Requirements & Constraints¶
See. First trap is solving the wrong question. Ask crisp questions, then freeze scope.
Functional requirements - Allow creators to upload large video files reliably, because scope must stay explicit. - Transcode videos into multiple formats and bitrates, because scope must stay explicit. - Serve videos globally with fast startup and smooth playback, because scope must stay explicit. - Store metadata, comments count, and recommendation hooks, because scope must stay explicit. - Support resumable uploads and creator-facing processing states, because scope must stay explicit.
Non-functional requirements - Upload pipeline must survive retries and large file sizes, because that constraint changes architecture. - Playback availability should stay high across regions, because that constraint changes architecture. - CDN delivery should absorb the majority of read traffic, because that constraint changes architecture. - Transcoding should scale asynchronously with clear backpressure, because that constraint changes architecture. - Recommendation integration should not block core playback, because that constraint changes architecture.
Constraints and assumptions - Assume 5 million new video uploads per day, so your estimate stays grounded. - Assume average uploaded source size is 500 MB, so your estimate stays grounded. - Assume 300 million video plays per day globally, so your estimate stays grounded. - Assume peak watch traffic is much higher than upload traffic, so your estimate stays grounded.
What to explicitly de-scope - Live streaming is a different pipeline and is out for now, because interview time is limited. - Copyright fingerprinting details are simplified here, because interview time is limited. - Comment serving and search indexing are secondary systems, because interview time is limited. - Advanced recommendation model training is outside this answer, because interview time is limited.
On the stage, say what is in and out. That makes the choreography visible and saves time.
Step 2: Scale Estimation¶
Now watch. Use round numbers, not thesis-level math. One minute of math can remove ten minutes of confusion.
Assumptions - 5 million uploads per day is about 58 uploads per second average, so the back-of-envelope math stays honest. - At 500 MB each, daily ingest is roughly 2.5 petabytes, so the back-of-envelope math stays honest. - 300 million plays per day is about 3,500 plays per second average, so the back-of-envelope math stays honest. - Peak playback traffic can be 10 times average or more, so the back-of-envelope math stays honest.
Quick math - Playback bytes dominate storage reads because viewers outnumber uploaders massively, which directly changes component choices. - One source video produces many renditions and thumbnails, which directly changes component choices. - Transcoding compute cost scales with duration and target formats, which directly changes component choices. - CDN cache hit rate is critical for origin cost, which directly changes component choices. - Adaptive bitrate manifests are tiny compared with video chunks, which directly changes component choices.
Capacity implications - Keep upload ingest and playback delivery as separate concerns, so the design stays proportional. - Store source files durably before heavy processing starts, so the design stays proportional. - Use queues to smooth transcoding spikes, so the design stays proportional. - Push final chunks to CDN-friendly object storage, so the design stays proportional.
Latency budget - Upload acknowledgment should happen before transcoding completes, because user feel matters early. - Playback startup time should stay low for viewer retention, because user feel matters early. - Manifest and first chunk retrieval matter more than total file size, because user feel matters early. - Recommendation fetch can be parallel to playback page rendering, because user feel matters early.
These numbers shape the first blueprint. Simple, no? Design follows load.
Step 3: High-Level Design¶
See. Keep the top-level flow boring and understandable. The interviewer rewards a clean blueprint before clever tricks.
┌──────────┐ ┌────────────┐ ┌──────────────┐
│ creators │──→│ upload API │──→│ object store │
└──────────┘ └──────┬─────┘ └──────┬───────┘
│ │
┌─────▼─────┐ ┌─────▼──────┐
│ metadata │ │ transcode │
│ service │ │ queue │
└─────┬─────┘ └─────┬──────┘
│ │
┌─────▼─────┐ ┌─────▼──────┐
│ video DB │ │ workers │
│ + status │ │ renditions │
└─────┬─────┘ └─────┬──────┘
│ │
┌─────▼────────────────▼──────┐
│ CDN + manifest/chunk store │
Main flow - Creator uploads the file, preferably in resumable parts, so the read and write path stays clear. - Upload API stores parts in object storage and records metadata, so the read and write path stays clear. - Once upload is committed, a transcode job is enqueued, so the read and write path stays clear. - Workers generate renditions, thumbnails, and manifest files, so the read and write path stays clear. - Playback clients fetch manifests and chunks through the CDN, so the read and write path stays clear. - Recommendation service is queried separately for the watch page, so the read and write path stays clear.
Data model sketch - Video metadata stores owner, title, duration, and processing status, so keys and queries stay obvious. - Chunk manifests map rendition names to object-store paths, so keys and queries stay obvious. - Transcode job state tracks progress, retries, and errors, so keys and queries stay obvious. - Playback analytics events capture startup delay and watch behavior, so keys and queries stay obvious.
What to say aloud - Start by splitting upload, processing, and playback into separate paths, so the interviewer hears your structure. - Use reasoning aloud to explain why transcoding is asynchronous, so the interviewer hears your structure. - Mention that CDN should serve most bytes, not origin servers, so the interviewer hears your structure. - State that recommendation hooks should not block core streaming, so the interviewer hears your structure.
Step 4: Deep Dive¶
So what to do? Pick two hotspots and go deeper. Do not deep dive everywhere.
Component 1: Upload ingest and resumable storage¶
Goal - Accept very large files without forcing restarts on failure, so the deep dive has a target. - Record processing state so creators know what is happening, so the deep dive has a target.
Design notes - Use multipart or chunked uploads with client-side retries, because details must still map to scale. - Store source files in durable object storage before processing, because details must still map to scale. - Commit metadata only after all required parts arrive, because details must still map to scale. - Expose statuses like uploading, processing, ready, and failed, because details must still map to scale.
Component 2: Transcoding and adaptive bitrate delivery¶
Goal - Produce multiple renditions for different devices and networks, so the deep dive has a target. - Serve the right chunk quality during changing bandwidth conditions, so the deep dive has a target.
Design notes - Workers pull jobs from a queue and write output manifests, because details must still map to scale. - Generate common bitrate ladders instead of arbitrary per-video formats, because details must still map to scale. - Adaptive players choose renditions chunk by chunk, because details must still map to scale. - CDN caches popular chunks close to viewers globally, because details must still map to scale.
Use reasoning aloud to compare one easy option and one scalable option. Add an honest gap if exact thresholds are unknown.
Interviewer follow-ups to prepare - How do you avoid duplicate transcoding after retry? - What changes when a video suddenly goes viral? - How would you support private or unlisted videos? - Where do captions and thumbnails fit in the pipeline?
Why not the simpler alternative? - Streaming directly from upload servers is simple, but impossible at scale, so tradeoffs stay visible. - Synchronous transcoding gives quick certainty, but terrible upload latency, so tradeoffs stay visible. - Serving from origin without CDN is easy, but far too expensive, so tradeoffs stay visible. - Over-personalizing playback path adds risk to a core experience, so tradeoffs stay visible.
Step 5: Tradeoffs & Failure Modes¶
Now watch. Senior answers end with tradeoffs and breakage paths. That is where judgment shows up.
Tradeoffs - More renditions improve playback quality, but increase storage and compute, so the interviewer hears the cost clearly. - Aggressive CDN caching cuts cost, but invalidation becomes harder, so the interviewer hears the cost clearly. - Resumable uploads help creators, but require more state management, so the interviewer hears the cost clearly. - Asynchronous processing scales well, but creators wait before publish, so the interviewer hears the cost clearly. - Recommendation integration boosts engagement, but should stay decoupled, so the interviewer hears the cost clearly.
Failure modes - Partial upload commit bugs can create corrupt source files, because real systems always break somewhere. - Transcode queue backlog can delay publishing for creators, because real systems always break somewhere. - Worker crashes can leave job state stuck mid-processing, because real systems always break somewhere. - CDN cache miss storms can overload origin storage, because real systems always break somewhere. - Manifest bugs can break playback across many devices, because real systems always break somewhere.
Recovery levers - Use idempotent job keys so retries do not duplicate work, so failure discussion ends with action. - Keep previous manifests until new renditions are fully validated, so failure discussion ends with action. - Autoscale workers based on queue depth and video duration mix, so failure discussion ends with action. - Expose creator retry controls for failed processing states, so failure discussion ends with action.
Close with an honest gap on one metric you would validate live. That sounds calm, not weak.
Interview Q&A¶
Q1. Why not transcode during upload synchronously? A: Because uploads are large and unreliable. Synchronous transcoding would make user experience terrible and waste compute on failed uploads. Common wrong answer to avoid: Upload servers can also be the best playback servers.
Q2. Why is object storage central here? A: Because video bytes are huge, durable, and accessed by many downstream systems. Object storage handles that better than databases. Common wrong answer to avoid: You should transcode everything inline before acknowledging upload.
Q3. What is adaptive bitrate in simple terms? A: The player switches among renditions based on current network conditions. That keeps playback smoother across devices and bandwidth changes. Common wrong answer to avoid: A CDN is optional if origin storage is strong enough.
Q4. Where should recommendations connect? A: At the watch-page metadata layer, not the video chunk path. Recommendation latency should not block basic playback. Common wrong answer to avoid: Recommendation logic belongs directly inside chunk serving.
Apply now (5 min) — practice exercise¶
Take five minutes. Do this without notes.
Practice checklist - Estimate ingest bytes and playback bytes separately, so your rehearsal stays focused. - Draw upload, transcode, and playback as three lanes, so your rehearsal stays focused. - Explain why object storage sits early in the pipeline, so your rehearsal stays focused. - Pick one playback metric you would watch closely, so your rehearsal stays focused. - Name one queue-based backpressure signal, so your rehearsal stays focused.
Self-check - Did you separate creator flow from viewer flow? - Did you mention resumable uploads? - Did you explain adaptive bitrate clearly? - Did you keep recommendation integration off the hot chunk path?
Say this opening - Open with the three paths: upload, process, and serve, so your first minute sounds controlled. - Then place object storage, queues, and CDN, so your first minute sounds controlled. - Finish with adaptive bitrate and failure handling, so your first minute sounds controlled.
Run the choreography once in short form, then once with details. Stay aware of the stage and pause for questions.
Bridge. Video served. Now the layer that makes reads fast — distributed cache. → 08