12. Design Real-Time Analytics¶

⏱️ Estimated time: 20 min | Level: advanced

ELI5 callback: You are on the stage, pitching city council with a blueprint. Follow the choreography, show reasoning aloud, and admit every honest gap.

Step 1: Requirements & Constraints¶

Start wide. Then narrow. See. Scope first, technology later.

Functional requirements¶

Ingest product and business events continuously from many producers.
Aggregate metrics in near real time for dashboards and alerts.
Support slice-and-dice queries by dimensions, filters, and time windows.
Provide approximate distinct counts for massive cardinality metrics.
Retain raw or rolled-up data for historical trend analysis.
Serve time-series visualisation with predictable latency.

Non-functional requirements¶

End-to-end freshness should stay within seconds for most dashboards.
Query latency should remain low even under heavy concurrency.
The system must handle out-of-order and late-arriving events.
Storage cost must stay controlled through rollups and tiering.
Accuracy should be explicit: exact where needed, approximate where acceptable.
Backfills must not break current online dashboards.

Clarifying questions to ask¶

What event schema governance exists across producers today?
Which metrics require exact counts versus approximate counts?
How late can events arrive and still need correction?
What dashboard freshness is acceptable for executives versus operators?
Do analysts need ad hoc SQL or only predefined dashboards?
How long must raw events remain queryable before rollup?

What to say on the whiteboard¶

State the user action, core data, and critical latency target.
Split must-have features from nice-to-have features immediately.
Name one honest gap before locking assumptions. Simple, no?
Ask what failure hurts most: money, freshness, or user trust.
Confirm whether single-region launch is acceptable for round one.
Summarise the scope before you move to numbers. Now watch.

Step 2: Scale Estimation¶

Do rough math. Clean math beats fancy math. So what to do? Pick clear assumptions and keep them verbal.

Assumptions¶

Assume 5 million events per second at peak.
Assume average event size is 500 bytes compressed on the wire.
Assume each event carries 10 dimensions and 5 numeric measures.
Assume dashboards refresh every 5 seconds for active users.
Assume raw event retention is 30 days in cheap storage.
Assume aggregate tables keep 1-minute, 1-hour, and 1-day rollups.

Back-of-envelope math¶

5M events per second at 500 bytes is about 2.5 GB per second ingest.
That becomes roughly 216 TB raw per day before replication.
One-minute rollups shrink query volume dramatically for dashboard reads.
Distinct-user metrics with exact sets can explode memory at this scale.
HyperLogLog sketches cut cardinality storage to kilobytes per group.
If 100 thousand dashboards refresh every 5 seconds, query fan-out is intense.
Pre-aggregated cubes reduce scan cost, but dimension explosion must be controlled.
Late-event correction requires mutable aggregates or compensating updates.

Interview cue¶

Say the biggest number first, then derive storage and bandwidth.
Round aggressively. Nobody wants calculator theatre on the board.
Mention peak-to-average ratio and why it changes capacity planning.
Keep one reserve factor for retries, bursts, and replays.
Remember the stage is interactive, so sanity-check assumptions aloud.
End with the two numbers that drive architecture choice.

Step 3: High-Level Design¶

Now place the big boxes. Your blueprint should fit in one glance.

+---------+   +-----------+   +----------------+   +--------------+
| Sources |-->| Ingestion |-->| Stream Process |-->| OLAP Store   |
+---------+   +-----------+   +----------------+   +--------------+
                              |         |                 |
                              v         v                 v
                        +-----------+ +-----------+  +--------------+
                        | HLL /     | | Time      |  | Query Layer  |
                        | Sketches  | | Series DB |  | + Cache      |
                        +-----------+ +-----------+  +--------------+
                              |         |                 |
                              v         v                 v
                        +---------------------------------------------+
                        | Object Storage for Raw Events and Backfills |
                        +---------------------------------------------+

Main components¶

Ingestion layer validates schema and partitions events by key and time.
Stream processor computes rolling aggregates and windowed metrics.
Sketch service maintains approximate cardinality structures like HyperLogLog.
Time-series store serves metrics optimised for chronological queries.
OLAP store holds columnar aggregates for slice-and-dice exploration.
Query layer routes requests to the right store and merges results.
Cache absorbs repeated dashboard reads for the hottest panels.
Object storage keeps raw history for replay, backfill, and audits.

Request path¶

Producers send events continuously into the ingestion layer.
Validated events are partitioned and appended to the streaming backbone.
Stream jobs compute sums, counts, rates, and grouped windows.
Distinct counts update HLL sketches instead of raw sets at scale.
Recent metrics land in time-series storage for fast trend graphs.
Broader dimensional aggregates land in the OLAP store.
Query layer serves dashboards by selecting rollup level automatically.
Backfill jobs replay raw events when logic or schema changes.

Design narration¶

Start with ingress, then routing, then state, then async work.
Separate control plane decisions from data plane traffic early.
Show where metadata lives and where heavy payloads travel.
Mark caches, queues, and databases with their exact job.
Point out one synchronous dependency you may later relax.
Pause and let the interviewer choose the next zoom-in area.

Step 4: Deep Dive¶

Pick two parts that actually matter. Depth without structure becomes noise. See.

Component A — Streaming aggregation and late data handling¶

Use event time, not processing time, for business-correct windows.
Define watermarks so the system knows when to close windows approximately.
Allow late events within a bounded delay and update aggregates accordingly.
Emit correction records when very late data changes past windows.
Separate exactly-once claims from what storage and sinks truly support.
Keep partition keys balanced so one hot tenant does not dominate.
Checkpoint processor state frequently enough for fast recovery.
Use dead-letter handling for malformed events instead of silent drops.
Track freshness, lag, and drop rate as first-class product metrics.
Backfills should reuse the same aggregation logic where possible.

Component B — OLAP cubes, time-series, and approximate counting¶

Build rollups at multiple granularities to cap scan cost.
Keep high-cardinality dimensions out of every precomputed cube.
Use columnar storage and compression for analytical reads.
Route dense trend charts to the time-series store, not the OLAP engine.
Use HyperLogLog for approximate unique users when exactness is unnecessary.
Document error bounds so product teams know what the sketch means.
Precompute common dashboard panels, but leave room for ad hoc drilldowns.
Cache final query results for very hot dashboards with short TTLs.
Tier old data to cheaper storage and query it less interactively.
Align retention policy with finance, security, and compliance needs.

Deep-dive cue¶

Keep reasoning aloud clean while you zoom in.
Explain data model, hot path, and one ugly edge case.
Tie each deep dive back to a requirement you already named.
If numbers change the design, say that directly.
If one choice is uncertain, park it as research, not panic.
Return to the overall system before you get lost in detail.

Step 5: Tradeoffs & Failure Modes¶

Now show judgment. Interviewers hire the tradeoff thinker, not the diagram artist.

Key tradeoffs¶

Precomputing many cubes improves latency, but explodes storage for high dimensions.
Approximate counting is cheap, but some finance metrics still need exact counts.
Long lateness windows improve correctness, but delay stable dashboards.
One universal store sounds tidy, but specialised stores serve different access patterns better.
Aggressive caching helps dashboards, but can hide freshness regressions.
Raw retention helps backfills, but storage bills rise sharply.
Complex query flexibility empowers analysts, but can threaten cluster stability.
Strict schema governance improves trust, but slows producer teams initially.

Failure modes to discuss¶

Schema drift can silently corrupt aggregates and mislead executives.
Hot partitions can delay windows even when total cluster load seems fine.
Late data beyond watermark can create confusing metric corrections.
A bad rollup job can poison many dashboards at once.
HLL misuse can make exact-looking numbers that are only approximate.
Backfill traffic can starve live processing if resource pools are shared badly.
Query cache staleness can mask streaming lag during incidents.
Dimension explosion can make OLAP storage and indexing unaffordable.

Close the answer strongly¶

Say what breaks first under sudden load and how you contain it.
Compare the chosen design against one simpler alternative.
Mention operational metrics, not only code-level correctness.
Admit where future scale may require redesign. Honest and sharp.
Offer a phased rollout plan if the company is early-stage.
Finish with latency, reliability, and cost in one sentence.

Interview Q&A¶

Q1. Why not store raw events and compute everything at query time?¶

A. Because dashboard concurrency and latency targets would become painful quickly. A. Pre-aggregation turns repeated heavy scans into small predictable reads. Common wrong answer to avoid: Modern hardware makes pre-aggregation unnecessary.

Q2. When is HyperLogLog the right answer?¶

A. Use it for large distinct counts where tiny error is acceptable. A. Do not use it when finance or billing requires exact uniqueness. Common wrong answer to avoid: Use approximate counting for all distinct metrics.

Q3. Why separate OLAP and time-series stores?¶

A. Because trend graphs and dimensional drilldowns stress storage differently. A. Specialisation keeps both latency and cost under control. Common wrong answer to avoid: One generic database always handles both equally well.

Q4. How do you explain late data to an interviewer?¶

A. Talk about event time, watermarks, bounded lateness, and corrections. A. Then connect those mechanics to dashboard trust and user expectations. Common wrong answer to avoid: Ignore late events because most users will not notice.

Apply now (5 min)¶

Run the full choreography with a two-minute timer per step.
Redesign this system for billing where exact counts matter more.
State one metric that should stay approximate and one that must stay exact.
Explain why a 1-minute rollup and a 1-hour rollup both matter.
List three alerts for freshness and three for query health.
Say how you would isolate backfills from live dashboard traffic.
Choose one simplification for a small startup with ten dashboards only.

Bridge. All case studies done. Now — how are you actually scored? → 13