06. Transactions and Isolation — Keeping Data Consistent Under Pressure¶

~14 min read. Two users book the last ticket simultaneously. Who wins? Transactions decide.

Built on the ELI5 in 00-eli5.md. The overdue list — items that must update atomically — shows exactly why isolation matters when multiple readers and writers collide.

1. ACID: The Four Promises a Database Makes¶

Every serious database makes four guarantees per transaction. Miss one and chaos follows.

┌─────────────────────────────────────────────────────────┐
│  A — Atomicity    All steps commit or all steps roll back│
│  C — Consistency  Data moves from one valid state to     │
│                   another; no constraint is ever broken  │
│  I — Isolation    Concurrent txns behave as if serial    │
│  D — Durability   Committed data survives crashes        │
└─────────────────────────────────────────────────────────┘

Atomicity is the simplest to reason about. Debit account A, credit account B. Both happen or neither happens. No partial money creation.

Consistency is partially the database's job (foreign keys, CHECK constraints) and partially the application's job (business rules).

Isolation is where most interview questions live. We cover it deeply in the next section.

Durability means the database wrote to stable storage before confirming your COMMIT. WAL (write-ahead log) is the usual mechanism.

2. Isolation Levels — The Dial Between Safety and Speed¶

Stronger isolation costs more performance. Every database lets you tune the dial.

Weakest ──────────────────────────────────────── Strongest
   │                                                  │
   ▼                                                  ▼
READ           READ          REPEATABLE        SERIALIZABLE
UNCOMMITTED  COMMITTED        READ
   │              │              │                    │
   │  dirty read  │   non-repeat │  phantom read      │
   │  possible    │   read poss. │  possible          │  none

Read Uncommitted¶

You can read rows that another transaction has modified but not yet committed. That unconfirmed data is called a dirty read. Imagine the overdue list showing a book as returned before the librarian finalises the return transaction. Dangerous.

Read Committed¶

You only see committed rows. Dirty reads are gone. But between two reads in your transaction, another transaction can commit a change. Your second read sees different data — a non-repeatable read.

Repeatable Read¶

Your transaction sees the same row values from start to finish. Non-repeatable reads are prevented. But new rows inserted by another transaction can appear in a range scan — a phantom read. MySQL InnoDB's default level; it uses gap locks to block phantoms too.

Serializable¶

Full isolation. Transactions behave as if run one at a time. Slowest, but safest. Use for financial totals, inventory counts, the overdue list audit report.

3. Anomalies in Detail — Dirty, Phantom, Write Skew¶

Dirty Read Example¶

Txn A: UPDATE books SET status='returned' WHERE id=42;
Txn B: SELECT status FROM books WHERE id=42;  -- sees 'returned'
Txn A: ROLLBACK;
-- Txn B acted on data that never existed

Phantom Read Example¶

Txn A: SELECT COUNT(*) FROM overdue WHERE patron_id=7;  -- returns 3
Txn B: INSERT INTO overdue (patron_id, book_id) VALUES (7, 99);
Txn B: COMMIT;
Txn A: SELECT COUNT(*) FROM overdue WHERE patron_id=7;  -- returns 4 !!
-- Same query, different result set — the phantom row appeared

Write Skew — The Sneaky One¶

Write skew happens when two transactions each read the same data, make decisions based on it, then write to different rows. Neither write alone violates a constraint, but together they do.

Constraint: at least one doctor must be on call.
Doctor 1 txn: SELECT count(*) FROM on_call;  -- 2
Doctor 2 txn: SELECT count(*) FROM on_call;  -- 2
Doctor 1: DELETE FROM on_call WHERE id=1;  COMMIT;
Doctor 2: DELETE FROM on_call WHERE id=2;  COMMIT;
-- Zero doctors on call. Constraint violated. Write skew.

Prevention: SELECT FOR UPDATE or Serializable isolation.

4. MVCC — How Databases Avoid Locking Readers¶

Most modern databases (PostgreSQL, MySQL InnoDB, Oracle) use Multi-Version Concurrency Control. Instead of locking a row when you read it, the database keeps multiple versions of each row.

Row id=42 history:
┌───────────────────────────────────────────────────────┐
│  txn_id=100  status='borrowed'   (visible to old txns)│
│  txn_id=105  status='returned'   (visible to new txns)│
└───────────────────────────────────────────────────────┘

Each transaction sees a snapshot taken at its start time. Readers never block writers. Writers never block readers. Only writer-writer conflicts need locks.

Garbage collection periodically removes old row versions no transaction can still see. PostgreSQL calls this VACUUM. Neglecting it causes table bloat.

5. Choosing the Right Isolation Level in Practice¶

Use case                         Recommended level
─────────────────────────────────────────────────
Reporting / analytics reads      Read Committed (fast, acceptable)
Most OLTP (orders, payments)     Repeatable Read
Audit log, inventory deduction   Serializable
High-throughput counters         Optimistic locking + retry loop

Rule of thumb: start with Read Committed, move up only when you have a reproducible anomaly. Serializable on every query will hurt throughput at scale.

The overdue list batch job runs nightly. Use Serializable so the count is exact even while librarians are checking books in concurrently.

Where this lives in the wild¶

Stripe (Backend Engineer) — Payment transactions use Serializable isolation for ledger entries. An interview question asked a candidate to spot the write-skew risk in a double-spend scenario.

Airbnb (DB Reliability) — Booking confirmation uses SELECT FOR UPDATE on the calendar slot row, preventing double-booking under Repeatable Read.

Netflix (Data Platform) — Nightly billing aggregation queries use Read Committed snapshots via MVCC. Speed matters more than cross-row consistency here.

Amazon DynamoDB (Distributed Systems) — Offers transactions across items using a 2-phase commit protocol; isolation is Serializable but limited to 25 items per transaction.

GitHub (Storage Engineering) — MySQL Repeatable Read is the default. Engineers explicitly bump to Serializable for repository rename operations that touch multiple tables atomically.

Pause and recall¶

What is the difference between a dirty read and a phantom read? Give a one-sentence example of each.
Write skew is not prevented by Repeatable Read. Why not? Draw a timeline showing two transactions.
MVCC allows readers to not block writers. What is the cost of this approach over time?
You are building the nightly overdue list report. Which isolation level do you pick and why?

Interview Q&A¶

Q: What does ACID stand for, and which property is hardest to implement in a distributed system?

Atomicity, Consistency, Isolation, Durability. Durability is hard distributed because you need all replicas to confirm the write before acknowledging commit — that conflicts with low latency goals.

Common wrong answer to avoid: Saying "Consistency is hardest" without distinguishing database-level consistency (constraints) from CAP-theorem consistency (linearizability). They are different concepts entirely.

Q: Explain the difference between Repeatable Read and Serializable isolation.

Repeatable Read guarantees that re-reading a row returns the same value, but range queries can still return new rows inserted by concurrent transactions (phantom reads). Serializable prevents phantom reads too by logically serializing all transactions.

Common wrong answer to avoid: Claiming Repeatable Read prevents phantoms universally. MySQL InnoDB uses gap locks to block phantoms even at Repeatable Read, but that is an implementation detail, not the SQL standard definition. Mention the distinction.

Q: What is MVCC and why do databases use it?

MVCC stores multiple timestamped versions of each row. Readers see a snapshot from their transaction start time. This eliminates reader-writer lock contention, improving throughput significantly on read-heavy workloads.

Common wrong answer to avoid: Saying MVCC eliminates all locking. Writer-writer conflicts still require locks or optimistic conflict detection. MVCC removes reader-writer contention only.

Q: How would you prevent write skew in a "one doctor on call" system?

Either use Serializable isolation (database ensures serial ordering) or use SELECT FOR UPDATE on the on-call table row at the start of each transaction to acquire an explicit lock before the decision.

Common wrong answer to avoid: Saying "use a unique constraint." Unique constraints prevent duplicate inserts but do not protect against the specific pattern where two deletions together violate a count-based invariant.

Apply now (5 min)¶

Exercise: You have a library database. Two patrons try to borrow the last copy of the same book simultaneously. Write the transaction (pseudocode or SQL) using SELECT FOR UPDATE to safely check availability and decrement the copy count. Then identify which anomaly you are preventing and at which isolation level it would occur without the lock.

Sketch from memory: Draw the four isolation levels on a spectrum. For each level, write the one anomaly it still allows. Then place MVCC on a separate axis showing "snapshot age" vs "lock overhead."

Bridge. You now know how a single database keeps data consistent. Next: what happens when you run multiple copies of that database? → 07-replication-strategies.md