11. Design Recommendation Engine¶

⏱️ Estimated time: 20 min | Level: advanced

ELI5 callback: You are on the stage, pitching city council with a blueprint. Follow the choreography, show reasoning aloud, and admit every honest gap.

Step 1: Requirements & Constraints¶

Start wide. Then narrow. See. Scope first, technology later.

Functional requirements¶

Generate personalised item recommendations for home, detail, and notification surfaces.
Combine collaborative filtering, content signals, and learned embeddings.
Retrieve a broad candidate set, then rank and re-rank in real time.
Support exploration, freshness, diversity, and business rules.
Update signals quickly when a user clicks, skips, or purchases.
Expose explanations and offline metrics for model iteration.

Non-functional requirements¶

Keep online latency low enough for interactive product surfaces.
Handle very large user and item catalogs efficiently.
Avoid stale recommendations after strong user intent changes.
Balance relevance against diversity and fairness goals.
Permit cold-start behaviour for new users and new items.
Make feature and embedding pipelines observable end to end.

Clarifying questions to ask¶

What surface are we optimising: home feed, ads, search, or email?
Is the primary metric CTR, watch time, conversion, or retention?
How fresh must recommendations be after a new interaction?
Do we rank only candidate IDs or also generate candidate sets online?
How important are diversity, fairness, or catalog coverage constraints?
Do we have explicit feedback, implicit feedback, or both?

What to say on the whiteboard¶

State the user action, core data, and critical latency target.
Split must-have features from nice-to-have features immediately.
Name one honest gap before locking assumptions. Simple, no?
Ask what failure hurts most: money, freshness, or user trust.
Confirm whether single-region launch is acceptable for round one.
Summarise the scope before you move to numbers. Now watch.

Step 2: Scale Estimation¶

Do rough math. Clean math beats fancy math. So what to do? Pick clear assumptions and keep them verbal.

Assumptions¶

Assume 100 million monthly active users.
Assume 10 million active items in the main catalog.
Assume peak recommendation requests are 300 thousand per second.
Assume each request needs 500 initial candidates before ranking.
Assume user embeddings are 256 floats and item embeddings are 256 floats.
Assume 20 percent of traffic needs fresh behavioural signals under five minutes.

Back-of-envelope math¶

300k RPS with 500 candidates means 150 million candidate scores per second before pruning.
A 256-float embedding at 4 bytes per float is about 1 KB.
Ten million item embeddings need roughly 10 GB raw, before indexes and replicas.
User embeddings for the active set can add another large but manageable store.
If final ranker scores 200 items per request, that is 60 million model scores per second.
Real-time feature joins can dominate latency more than pure vector search.
Daily interaction events will be far larger than serving reads, so pipelines must separate roles.
Reserve headroom because recommendation spikes follow traffic spikes everywhere else.

Interview cue¶

Say the biggest number first, then derive storage and bandwidth.
Round aggressively. Nobody wants calculator theatre on the board.
Mention peak-to-average ratio and why it changes capacity planning.
Keep one reserve factor for retries, bursts, and replays.
Remember the stage is interactive, so sanity-check assumptions aloud.
End with the two numbers that drive architecture choice.

Step 3: High-Level Design¶

Now place the big boxes. Your blueprint should fit in one glance.

+--------+   +-------------+   +----------------+   +--------------+
| Client |-->| Rec Gateway |-->| Candidate Gen  |-->| Ranker       |
+--------+   +-------------+   +----------------+   +--------------+
                               |         |                 |
                               v         v                 v
                        +-----------+ +-----------+  +--------------+
                        | Two-Tower | | Rules /   |  | Re-ranker    |
                        | Retrieval  | | Filters   |  | Fresh Signals |
                        +-----------+ +-----------+  +--------------+
                               |         |                 |
                               v         v                 v
                        +---------------------------------------------+
                        | Feature Store + Embedding Store + Event Log |
                        +---------------------------------------------+

Main components¶

Recommendation gateway authenticates requests and chooses the product surface.
Candidate generation mixes collaborative, content-based, and popularity sources.
Two-tower retrieval fetches nearest items using user and item embeddings.
Rules engine applies eligibility, policy, and business constraints.
Ranker scores candidates with richer features and context.
Re-ranker injects fresh actions, diversity, and session intent.
Feature store serves user, item, and contextual features online.
Event log captures clicks, skips, dwell time, and purchases for retraining.

Request path¶

Client asks for recommendations for a given surface.
Gateway fetches the latest user context and request metadata.
Candidate generators contribute overlapping item pools.
Two-tower retrieval adds semantically similar items from embedding search.
Rules engine removes blocked, out-of-stock, or repeated items.
Ranker scores the merged pool using rich features.
Re-ranker boosts freshness, diversity, and last-minute intent signals.
Top results return while logs stream back for learning.

Design narration¶

Start with ingress, then routing, then state, then async work.
Separate control plane decisions from data plane traffic early.
Show where metadata lives and where heavy payloads travel.
Mark caches, queues, and databases with their exact job.
Point out one synchronous dependency you may later relax.
Pause and let the interviewer choose the next zoom-in area.

Step 4: Deep Dive¶

Pick two parts that actually matter. Depth without structure becomes noise. See.

Component A — Embeddings and two-tower retrieval¶

Train separate user and item towers that project both sides into one space.
Use recent behaviour, profile features, and context as user tower inputs.
Use metadata, text, media, and historical performance as item tower inputs.
Precompute item embeddings offline and update user embeddings more often.
Store embeddings in an ANN index for fast top-K retrieval.
Keep model versioning explicit so index and scorer stay compatible.
Cold-start items can lean on content features before interaction history exists.
Cold-start users can borrow cohort signals and trending items.
Monitor embedding drift because stale vectors quietly reduce relevance.
Refresh indexes incrementally so freshness does not require full rebuilds daily.

Component B — Real-time re-ranking and feedback loop¶

Use session clicks, skips, and dwell time to detect immediate intent changes.
Apply diversity rules so the page is not ten near-identical items.
Add business constraints like sponsored slots or safety rules late in the stack.
Keep re-ranking lightweight because it lives inside tight latency budgets.
Feed online interactions back to event streams within seconds.
Materialise short-lived session features in fast online storage.
Do not let one recent click completely erase long-term preference blindly.
Measure freshness lift against stability loss when tuning re-ranker weight.
Separate exploration traffic so learning improves without harming everyone.
Expose explanation tags for debugging and stakeholder trust.

Deep-dive cue¶

Keep reasoning aloud clean while you zoom in.
Explain data model, hot path, and one ugly edge case.
Tie each deep dive back to a requirement you already named.
If numbers change the design, say that directly.
If one choice is uncertain, park it as research, not panic.
Return to the overall system before you get lost in detail.

Step 5: Tradeoffs & Failure Modes¶

Now show judgment. Interviewers hire the tradeoff thinker, not the diagram artist.

Key tradeoffs¶

Collaborative filtering captures behaviour patterns, but struggles with cold start.
Content-based methods handle new items, but may overfit obvious similarity.
Two-tower retrieval scales well, but exact ranking quality depends on later models.
Heavy re-ranking improves relevance, but burns latency budget quickly.
More diversity helps discovery, but can reduce short-term CTR.
Fast feedback loops improve freshness, but increase online system complexity.
One global model is simpler, but different surfaces often need distinct objectives.
Strict business rules protect policy, but can hide genuinely relevant items.

Failure modes to discuss¶

ANN index staleness can make new items effectively invisible.
Feature freshness lag can rank based on yesterday while the user changed today.
Feedback loops can over-amplify already popular items and reduce coverage.
Broken filters can recommend blocked or unavailable items.
Cold-start users can receive bland lists if no fallback strategy exists.
Online and offline feature skew can silently corrupt model expectations.
Re-ranker bugs can destroy diversity or flood one merchant unfairly.
Bad experiment design can misread long-term quality as short-term CTR noise.

Close the answer strongly¶

Say what breaks first under sudden load and how you contain it.
Compare the chosen design against one simpler alternative.
Mention operational metrics, not only code-level correctness.
Admit where future scale may require redesign. Honest and sharp.
Offer a phased rollout plan if the company is early-stage.
Finish with latency, reliability, and cost in one sentence.

Interview Q&A¶

Q1. Why not just use collaborative filtering and finish?¶

A. Because cold start, content understanding, and real-time context still matter. A. Modern systems blend signals instead of betting on one technique only. Common wrong answer to avoid: Collaborative filtering solves recommendations by itself.

Q2. What is two-tower retrieval buying us?¶

A. It gives fast semantic candidate generation over huge catalogs. A. Then a richer ranker can spend compute on a much smaller set. Common wrong answer to avoid: Two-tower models replace the need for ranking entirely.

Q3. Why add real-time re-ranking after the main ranker?¶

A. Because session intent changes faster than most heavy models can update. A. The re-ranker injects freshness, diversity, and rules late but cheaply. Common wrong answer to avoid: Real-time re-ranking is only for ads, not recommendations.

Q4. How do you handle new items with no interactions?¶

A. Use content features, supplier metadata, and exploration allocation. A. Then promote carefully until feedback arrives and the item earns placement. Common wrong answer to avoid: Hide new items until enough interaction data appears.

Apply now (5 min)¶

Run the full choreography with a two-minute timer per step.
Redesign this engine for a catalog that changes every hour.
List three features for the two-tower model and three for the ranker.
Explain one diversity rule that helps users without tanking quality.
Say how you would detect embedding drift before business metrics crash.
Name one metric for retrieval and one for final ranking.
Choose one simplification for a startup with only one recommendation surface.

Bridge. Recommendations live. One more — real-time analytics. → 12