01. Relational Data Modeling — Put every fact on the right labeled shelf¶
~14 min read. Good schemas remove confusion before scale exposes it brutally.
Built on the ELI5 in 00-eli5.md. The shelf — organized shelves with labeled sections — now becomes tables, rows, and relationships.
1) Tables, rows, and columns are the first tidy shelves¶
See.
A relational database starts by grouping similar facts on one shelf.
Each table stores one kind of thing, not ten unrelated things.
Rows represent individual records.
Columns represent attributes that every row in that table should share.
That sounds simple.
Bad design starts exactly when teams ignore this simplicity.
Suppose we are modeling a library system.
We have 50,000 books, 80,000 members, and 4,000 daily loans.
Three clean tables already appear.
books stores book facts.
members stores member facts.
loans stores borrowing events.
┌───────────┐ ┌───────────┐ ┌──────────┐
│ books │ │ members │ │ loans │
├───────────┤ ├───────────┤ ├──────────┤
│ book_id │ │ member_id │ │ loan_id │
│ title │ │ name │ │ book_id │
│ author_id │ │ city │ │ member_id│
│ section │ │ joined_on │ │ due_date │
└───────────┘ └───────────┘ └──────────┘
library_records table, updates become painful.
You repeat member names on every loan.
You repeat book titles on every loan.
You invite inconsistency on every correction.
Worked example.
Assume one popular member borrows 60 books in one year.
If the member changes city once, a bad combined table needs 60 updates.
A normalized members table needs exactly one update.
That is the first practical win.
A row should answer one business question cleanly.
A column should mean the same thing for every row.
Null-heavy mystery columns are warning signs.
So what to do?
Start by naming entities, then actions between entities.
Entities become tables.
Actions often become separate tables too.
Simple, no?
2) Normalization removes repeated facts before they start arguing¶
Normalization sounds academic. Actually, it is just disciplined shelf organization. First normal form says one cell, one value. No comma-separated lists hiding inside a column. Second normal form says every non-key field should depend on the full key. Third normal form says non-key fields should not depend on other non-key fields. Let us make that real. Bad table:
Looks convenient today. Looks expensive tomorrow. If one author wrote 120 books,author_country repeats 120 times.
If history section moves from floor 2 to floor 3, every matching row changes.
Now split it.
authors(author_id, author_name, author_country)
sections(section_id, section_name, section_floor)
books(book_id, title, author_id, section_id)
books knows books.
authors knows authors.
sections knows section placement.
No table pretends to know everything.
One more trap.
Do not normalize blindly beyond business sense.
If a value never changes and never deserves separate lifecycle, keep it nearby.
Normalization is about removing harmful repetition, not performing ritual mathematics.
3) Foreign keys and ER diagrams make relationships explicit¶
Once tables exist, relationships must become enforceable.
Otherwise, your tidy shelves still lose books.
A foreign key says one table references a valid row elsewhere.
loans.book_id must reference books.book_id.
loans.member_id must reference members.member_id.
Without that, orphan rows slip in quietly.
book_authors(book_id, author_id).
That junction table is not extra ceremony.
It is the truthful model.
Primary keys identify rows.
Foreign keys connect trustworthy rows.
ER diagrams communicate the map before anyone writes queries.
Good modeling reduces later arguments between backend, analytics, and product teams.
4) Denormalize only when a repeated read is worth the repeated write¶
Now the common interview twist arrives.
If normalization is good, why denormalize at all?
Because reads have latency budgets.
Joins cost work.
Hot screens cannot always assemble five tables repeatedly.
Suppose the home dashboard shows currently borrowed books per member.
Every page view needs member name, active loan count, and latest due date.
Traffic is 2,000 dashboard reads per second.
Writes are only 30 loan updates per second.
You may choose a summary table.
member_loan_summary(member_id, active_count, latest_due_date).
That repeats derived data deliberately.
Worked example.
Assume the join-based query touches 6,000 pages per second at peak.
Average latency becomes 180 milliseconds.
After precomputed summary storage, latency drops to 18 milliseconds.
Write cost rises from 30 to 60 updates per second.
That trade can be excellent.
But denormalization has a price.
Now two places must stay in sync.
If update logic misses one path, dashboards lie.
So what to do?
Denormalize for stable, high-frequency read shapes.
Do not denormalize because joins feel emotionally scary.
Typical reasons are dashboards, search documents, read models, and exports.
Bad reasons are laziness and unclear schema thinking.
Here is the decision sketch.
normalized truth ──write once──→ reliable updates
normalized truth ──join often──→ slower hot reads
summary table ──read fast──→ extra write coordination
Where this lives in the wild¶
- Stripe billing engineer models invoices, customers, subscriptions, and payments with strict relational links because money facts must reconcile cleanly.
- Swiggy marketplace engineer keeps restaurants, menus, orders, and deliveries in related tables where foreign keys protect operational correctness.
- Amazon catalog data engineer normalizes sellers, products, and category mappings so updates do not spray duplicate metadata everywhere.
- Salesforce platform engineer relies on ER modeling because accounts, contacts, opportunities, and activities have dense business relationships.
- Uber payments engineer denormalizes read models only after measuring repeated support and dashboard queries against transactional truth.
Pause and recall¶
- Why should one table usually represent one kind of business fact?
- What practical problem does 3NF prevent during routine updates?
- When does a junction table appear, and why is it honest?
- What metric should justify denormalization: fashion or measured read pressure?
Interview Q&A¶
Q: Why normalize first instead of starting with one wide convenience table? A: Because repeated facts create inconsistent updates, unclear ownership, and painful query semantics. Normalization gives each business fact one reliable home before optimization begins. Common wrong answer to avoid: “Because normalization always makes queries faster.” Q: Why use foreign keys if application code already checks IDs? A: Because database constraints protect truth even during bugs, scripts, retries, and partial deployments. They stop orphan records at the storage boundary. Common wrong answer to avoid: “Foreign keys are only for documentation.” Q: When would you denormalize a relational model? A: When a repeated read path is hot, stable, and measurably expensive to reconstruct through joins. Then you keep normalized truth and add a deliberate projection. Common wrong answer to avoid: “Denormalize whenever joins exist.” Q: How do ER diagrams help beyond interviews? A: They reveal cardinality, ownership, and lifecycle boundaries before code spreads assumptions everywhere. Teams debug faster when the relationship map is visible. Common wrong answer to avoid: “ER diagrams are just for database administrators.”
Apply now (5 min)¶
Exercise:
Model a simple system with users, projects, and tasks.
Identify one one-to-many relationship and one many-to-many relationship.
Then decide which foreign keys belong where.
Sketch from memory:
Draw four boxes and one junction table.
Label each primary key and each foreign key arrow clearly.
Then circle one field you would denormalize only after measuring reads.
Bridge. Relational shelves are disciplined, but not every record looks identical. Some workloads want flexible boxes on the same shelf. → 02-nosql-document-keyvalue.md