01. MVCC, WAL, indexes — what Postgres actually does on every row¶

~18 min read. One innocent UPDATE orders SET status='paid' WHERE id=1234 looks like one disk write. It is not. By the end of this chapter you will see the five physical things Postgres really does, and why every senior interview eventually circles back to MVCC, WAL, and the index pick.

Builds on: 00-eli5.md.

The clerk-with-a-journal mental model is enough to believe Postgres works. It is not enough to debug it, or to answer the staff interview question "what happens when this UPDATE hits a row that already has 100 dead versions?" For that you need to look inside the page. So today we open one row, one WAL record, and one index entry, and we follow them all the way through.

One worked example threads this whole chapter. The query is:

UPDATE orders SET status='paid' WHERE id=1234;

A single statement. We will trace it through tuple creation, WAL flush, B-tree update, planner choice, and vacuum aftermath. Everything else hangs off this hook.

1) Tuples, pages, and the heap layout¶

A row in Postgres is called a tuple. Tuples live inside fixed-size pages, almost always 8 KB. Pages live inside files called heap segments, one segment per 1 GB chunk of a table. So the physical layout, from the inside out, is: column values → tuple → page → segment file → table.

Every tuple carries a small header. The fields that matter for this chapter are xmin (the transaction id that inserted this version) and xmax (the transaction id that deleted or superseded it, if any). There is also a tuple id, written ctid, which is just (page_number, offset_in_page) — the physical address of this version on disk.

ONE 8KB PAGE OF THE orders TABLE

┌────────────────────────────── page header (24 bytes) ──────────────────────────────┐
│ pd_lsn │ pd_checksum │ pd_lower │ pd_upper │ pd_special │ ...                       │
├──────────────────────────── item pointers (line ptrs) ─────────────────────────────┤
│ → off=8000  │ → off=7900 │ → off=7800 │ → off=free │ → off=free │ ...               │
├──────────────────────────────────── free space ────────────────────────────────────┤
│                                                                                    │
│                            (writes grow downward from here)                        │
│                                                                                    │
├────────────────────────── tuples (grow upward from page end) ──────────────────────┤
│ tuple at off=7800: xmin=421, xmax=0,  ctid=(0,3), id=1235, status='pending'        │
│ tuple at off=7900: xmin=420, xmax=0,  ctid=(0,2), id=1234, status='shipped'        │
│ tuple at off=8000: xmin=419, xmax=0,  ctid=(0,1), id=1233, status='paid'           │
└────────────────────────────────────────────────────────────────────────────────────┘

Notice three things. Item pointers grow downward from the top. Tuples grow upward from the bottom. In the middle is free space — the slot future inserts and updates will use. This layout is why a Postgres page can hold rows of varying sizes without rearrangement on every change.

Teacher voice. Postgres never rewrites a row in place when its size changes. It writes a new tuple and lets the old one die. This single design choice ripples into MVCC, vacuum, bloat, HOT updates — almost every operational concern you will ever debug.

For our running UPDATE, the row we care about is (0,2): order id 1234, currently status='shipped', inserted by transaction 420.

2) MVCC — multiple row versions, xmin/xmax, snapshot visibility¶

Now run our UPDATE. Transaction 500 issues UPDATE orders SET status='paid' WHERE id=1234. What happens physically?

Postgres does not edit tuple (0,2) in place. Instead it:

Writes a new tuple — call it (0,4) — with xmin=500, xmax=0, status='paid'.
Marks the old tuple (0,2) as superseded by setting its xmax=500.
Links the two with a pointer (t_ctid on the old version points to the new one), forming an update chain.

BEFORE UPDATE (tx 500 not yet committed)        AFTER UPDATE (still inside tx 500)
─────────────────────────────────────           ───────────────────────────────────

(0,2) xmin=420  xmax=0    status='shipped'      (0,2) xmin=420  xmax=500 status='shipped' ──┐
                                                                                            │
                                                (0,4) xmin=500  xmax=0   status='paid'   ◄──┘

Both versions exist on disk simultaneously. Every transaction asking "what is order 1234's status?" computes its snapshot — essentially "which transaction ids count as committed, from my point of view?" — and walks the update chain. A reader who started before tx 500 committed sees (0,2) and reads shipped. A reader who started after tx 500 commits sees (0,4) and reads paid. Neither blocks the other. Neither blocks the writer. This is MVCC.

Mini-FAQ. "If readers never block writers, what does Postgres lock?" It still takes a row-level lock on the tuple being updated to stop two concurrent UPDATEs from racing on the same row. It does not lock readers out. Two SELECTs against order 1234 during the UPDATE return immediately — they just see different versions depending on their snapshot.

The price of this design shows up later. Once tx 500 commits and no snapshot can still need (0,2), that tuple is dead — wasted space sitting in the page. VACUUM is the background process that walks pages, finds dead tuples whose xmax is below the oldest active snapshot, and marks their slots free. We will revisit vacuum properly in chapter 3, but the seed sits here: every UPDATE in Postgres is logically a delete + insert, and the delete must be cleaned up later. Write-heavy tables generate dead tuples faster than vacuum can clear them; this is bloat, and it is the most-cited Postgres pain point at every shop running serious load — Notion famously hit txid-wraparound-via-bloat on their block table before sharding, and Sentry built a vacuum-tuning playbook around their nodestore.

See. Updates are not edits. They are append-with-redirect. Hold this picture — every other section of this chapter feeds off it.

3) WAL — write-ahead log, fsync, redo on crash¶

Now ask the durability question. The UPDATE produced two changed tuples in memory (shared_buffers). What if the server crashes right now, before those pages are flushed to the heap files on disk?

Answer: Postgres survives because it wrote the change to the WAL before touching the data pages. WAL stands for write-ahead log — the rule is literally "log the intent first, modify the data later, and never let those two go out of order."

The flow for our UPDATE looks like this:

TIME ─────────────────────────────────────────────────────────────────────────►

  client                backend                shared_buffers          pg_wal/
   │                       │                        │                     │
   │ UPDATE orders ...  ─► │                        │                     │
   │                       │ build WAL record       │                     │
   │                       │ (insert+delete on heap)│                     │
   │                       ├──── append to ─────────┼────────────────────►│  (WAL buffer)
   │                       │ modify page in buffer ►│                     │
   │ COMMIT ──────────────►│                        │                     │
   │                       │ flush WAL up to LSN  ──┼────fsync()─────────►│  (disk durable)
   │                ◄──────│ "COMMIT OK"            │                     │
   │                       │                        │                     │
   │       (later, async)  │ checkpoint writes      │                     │
   │                       │ dirty pages back to ◄──┤                     │
   │                       │ heap on disk           │                     │

A WAL record is a compact description of "at LSN X, this is what changed on this page." LSN is the log sequence number — a monotonic 64-bit offset into the WAL stream. Every page on disk carries the LSN of the latest WAL record that affected it; on crash recovery, Postgres replays WAL forward from the last checkpoint, applying any record whose LSN is greater than the page's current LSN. The data files catch up to the log.

The fsync on COMMIT is the part that costs you. Until that fsync returns, the OS may still have your WAL bytes in the kernel page cache, and a power cut would lose them. Heroku's Postgres ships WAL to S3 every 60 seconds so customer databases can be restored to any second in the last week. GitLab streams WAL continuously from its primary to replicas and to cloud storage for point-in-time recovery on their 12 TiB cluster.

Teacher voice. Durability in Postgres is not "the row reached the table file." It is "the WAL record describing the row change reached durable storage." The table file lags behind, sometimes by minutes. The log is the truth.

WAL also feeds physical streaming replication. Replicas open a TCP connection to the primary, stream WAL bytes as they are generated, and replay them locally. A replica is, definitionally, a server that is some number of LSN bytes behind the primary. That lag is what you monitor in production.

4) B-tree indexes — when and how Postgres uses them¶

Our UPDATE has a WHERE id=1234 clause. How did Postgres find that one row out of, say, ten million?

Almost certainly via the primary key's B-tree index. A B-tree in Postgres is a balanced, sorted tree of pages. Internal pages hold separator keys and child pointers; leaf pages hold the actual indexed values paired with the ctid of the corresponding heap tuple. Lookups are O(log n) page reads — for 10M rows on a typical 8 KB page size, that is roughly 3–4 page reads to reach a leaf.

B-TREE INDEX ON orders(id) — simplified

                  ┌──────────────────┐
                  │ root: [500, 1500]│
                  └──┬──────┬───────┬┘
                     │      │       │
            ┌────────▼──┐ ┌─▼─────┐ ┌▼──────────┐
            │ [100,300] │ │ [800] │ │ [1800]    │     internal pages
            └─┬──┬──┬───┘ └─┬─┬───┘ └─┬─┬───────┘
              │  │  │       │ │       │ │
            ┌─▼┐┌▼┐┌▼┐    ┌─▼┐┌▼┐   ┌─▼┐┌▼┐
            │..││.││.│    │..││.│   │..││.│           leaf pages: (id, ctid)
            └──┘└─┘└─┘    └──┘└─┘   └──┘└─┘
            ... 1234 → (0,2) ...

For our UPDATE, Postgres descends to the leaf, finds 1234 → (0,2), fetches heap page 0, locates tuple (0,2), and locks it. After the new tuple (0,4) is written, the index has a choice. If the page containing (0,2) still has free space and no indexed column changed, Postgres performs a HOT update (Heap-Only Tuple) — the new tuple inherits the index entry of the old one via a chain pointer, and the B-tree is not touched. If id had changed, or the page was full, a new index entry 1234 → (0,4) would be inserted into the leaf. In our case status is not indexed, and HOT applies. One physical write to the heap, zero new index entries.

When the planner decides whether to use the index at all, the call is between an Index Scan (descend the tree, fetch matching heap tuples), an Index-Only Scan (answer from the index alone when the visibility map says the page is all-visible), a Bitmap Index Scan (build a bitmap of matching ctids, then visit heap pages in physical order), or a plain Seq Scan (read every page of the table). For our WHERE id=1234 predicate the cost model picks Index Scan trivially — one matching row out of ten million is a 0.00001% selectivity, the textbook case for B-tree.

Mini-FAQ. "Why would Postgres ever pick Seq Scan over an index?" Because for low-selectivity predicates — say, WHERE status='shipped' matching 40% of rows — visiting the heap once per match costs more random I/O than just reading the whole table sequentially. The crossover point is typically around 5–15% of the table. The planner makes this call using stats from ANALYZE.

One often-missed B-tree feature is INCLUDE. You can build CREATE INDEX ON orders(id) INCLUDE (status, amount) to fold extra columns into the leaf without making them part of the sort key. Index-Only Scans can then answer SELECT status, amount FROM orders WHERE id=1234 without touching the heap at all. Robinhood's brokerage team has written about exactly this kind of covering-index work as part of their pre-sharding optimisation pass.

5) GIN, GiST, BRIN, hash — when each wins¶

B-tree is the default and the right answer 80% of the time. The other 20% is where staff-level questions come from. Five index types live in mainline Postgres, each tuned for a different query shape.

B-tree — sorted, balanced, supports =, <, >, BETWEEN, ORDER BY. The default. Update-friendly. Roughly the size of the indexed column × number of rows.
Hash — equality only, no range queries. Slightly smaller than B-tree, slightly faster on =, but rarely worth the loss of range support. Crash-safe since Postgres 10.
GIN (Generalised Inverted Index) — maps each token inside a composite value to the rows containing it. The workhorse for jsonb containment queries, full-text search with tsvector, and arrays. Lookup speed scales with matches, not table size. Pays a 3× build-time and significant write-amplification penalty vs B-tree. Sentry and Algolia-replacement workloads on Postgres lean on GIN; pganalyze documents the bad-side cases too.
GiST (Generalised Search Tree) — a framework for indexable structures where ordering does not exist. PostGIS uses GiST to index R-tree-style spatial geometries; OpenStreetMap data loaded via osm2pgsql is unusable without USING gist (way) on every geometry column. Range types and exclusion constraints also use GiST.
BRIN (Block Range Index) — stores only min/max summaries for groups of pages (default 128 pages ≈ 1 MB per range). A BRIN index on a 50M-row time-series table can be a few MB, where the equivalent B-tree would be multiple GB. Brilliant for append-only data correlated with physical order — timestamps, sensor IDs, log sequences. Useless for randomly distributed data. TimescaleDB hypertables and Datadog-style host-metadata tables use BRIN on the time column for exactly this reason.

A more recent addition, HNSW via the pgvector extension, gives sub-20ms approximate nearest-neighbour search over high-dimensional embeddings. Production RAG stacks built on OpenAI's 1536-dimensional embeddings typically use m=16, ef_construction=128 as the starting point and rely on CREATE INDEX CONCURRENTLY to avoid blocking writes during build.

Teacher voice. Picking the index type is half art, half profile. Always start with "what does my predicate actually look like?" — equality, range, containment, similarity, distance. The predicate dictates the index, not the column name.

6) The planner and statistics — why ANALYZE matters¶

Postgres is a cost-based optimiser. For every query it generates candidate plans — different join orders, different index choices, different scan types — and estimates the cost of each using a formula combining page reads, CPU operations, and statistical row counts. The plan with the lowest estimated cost wins.

That estimate is only as good as the statistics. ANALYZE samples the table (default 30,000 rows for tables up to ~1B rows) and populates pg_statistic with: number of distinct values per column, most common values and their frequencies, a histogram of bucket boundaries, and a correlation between physical and logical ordering. The autovacuum daemon runs ANALYZE automatically when about 10% of a table's rows have changed since the last sample.

For our WHERE id=1234 predicate, the planner pulls these stats: n_distinct for id is roughly equal to the row count (it is a primary key), so estimated selectivity is 1 / n_distinct. For a 10M-row table that's 1e-7 — basically one row. Index Scan dominates. No ambiguity, ANALYZE barely matters.

The moment the predicate is WHERE status = 'shipped' it matters a lot. If status has only five distinct values and 'shipped' is 40% of them, the planner correctly chooses Seq Scan. If ANALYZE is stale and pg_statistic claims 'shipped' is 0.5% of rows, the planner picks Index Scan, fetches 4M heap pages randomly, and the query that should take 200 ms takes 30 s. This is the most common production planner regression: stale stats after a big bulk load, or after a long-running transaction blocked autovacuum-analyze for hours.

Mini-FAQ. "What is the difference between VACUUM and ANALYZE?" VACUUM reclaims dead tuples and updates the visibility map. ANALYZE re-samples rows to update planner statistics. VACUUM ANALYZE does both in one pass. Autovacuum schedules both independently — vacuum on dead-tuple thresholds, analyze on modified-row thresholds.

A second statistical gotcha is column correlation. The planner assumes column values are independent unless you tell it otherwise. If your orders table has 99% correlation between country and currency (Indian orders → INR, US orders → USD), the planner will multiply selectivities as if they were independent and badly underestimate. CREATE STATISTICS lets you declare multi-column dependencies so the planner gets the math right.

7) Comparison table¶

All numbers below are for Postgres 16 on AWS RDS db.r6g.large (2 vCPU, 16 GB RAM, gp3 SSD), a 50M-row orders table, single tuple ~200 bytes. Treat as order-of-magnitude reference, not benchmarks.

Index type	Best predicate	Read latency (single row)	Write amplification per row	Index size	Update behaviour
B-tree	`=`, `<`, `>`, `BETWEEN`, `ORDER BY`	0.3 – 0.8 ms	1× (HOT) to 2× (non-HOT)	~1.0 GB on 50M bigint keys	Updated unless HOT applies
Hash	`=` only	0.2 – 0.6 ms	~1.5×	~0.6 GB	Updated on every key change
GIN	`@>`, `?`, `to_tsquery`, array ops	1 – 5 ms (depends on match count)	3 – 5×	~3.0 GB on jsonb column	Pending list batches writes
GiST	spatial `&&`, range overlap, kNN	1 – 4 ms	~2×	~1.5 GB on geometry column	Updated per row
BRIN	range scans on physically ordered cols	5 – 50 ms (scans block ranges)	~0.05×	~3 MB on 50M timestamp rows	Almost free to update
HNSW (pgvector)	vector nearest neighbour	5 – 20 ms (1536-dim, ef_search=40)	high — full graph relinking	~6 GB on 1M 1536-dim vectors	Slow updates; rebuild common

For our running UPDATE — single row lookup on integer PK — the B-tree row is the relevant one. HOT applies (no indexed column changed), so write amplification is 1× and the index itself is untouched. Total physical work: one heap page modification, one WAL record (~80 bytes), one fsync on commit. The B-tree was used for the read, not the write.

Where this lives in the wild¶

These are sourced from live engineering blogs, vendor docs, and conference talks. Two natural categories: shops that fight MVCC/vacuum at write-heavy scale, and shops that lean on the non-B-tree index family for analytical or specialised workloads.

MVCC, WAL, and vacuum tuning at write-heavy scale:

Notion — VACUUM stalled on the central block table and threatened transaction-id wraparound; the fix was 480 logical shards across 32 physical Postgres instances keyed by workspace id.
GitLab — runs a 12 TiB Postgres cluster streaming WAL continuously to replicas with replication slots and shipping completed WAL files to cloud storage for point-in-time recovery.
Heroku Postgres — pushes base backups and WAL segments to S3 every 60 seconds so any customer database can be restored to any second in the retention window.
Sentry — vacuum tuning on the nodestore_node blob table is the famous operational pain point; their docs publish autovacuum overrides for write-heavy event ingestion.
Instagram — runs sharded Postgres on thousands of logical schemas with PL/pgSQL-based Snowflake-style id generation, and tunes autovacuum aggressively for write-heavy feed and likes tables.
Robinhood — sharded brokerage data per-user after Postgres redlined; spent significant time on covering indexes and unused-index pruning before sharding, using pganalyze and AWS RDS Performance Insights.
Twitch — runs ~94% of its 125 OLTP database hosts on Postgres, with the oldest cluster pushing 300k+ TPS and a vacuum/WAL pipeline tuned for that throughput.
Stripe — uses Serializable isolation for ledger transactions where double-entry invariants must hold, accepting the MVCC retry cost as the price of correctness.
GitLab.com Geo — uses physical replication slots so a secondary site that disconnects briefly does not lose its WAL position; inactive slots are a known footgun because they hold WAL forever.
Notion (post-shard ops) — uses pganalyze to monitor per-shard autovacuum lag and bloat ratio across all 480 logical shards.
TripAdvisor — splits content types (members, media, reviews) across separate Postgres instances and tunes vacuum thresholds per-instance based on write profile.
Reddit (media metadata) — uses Aurora Postgres with pg_partman time-based partitioning and pg_cron to keep individual partitions small enough that VACUUM finishes inside its window.
Skype (legacy core) — used Skytools (PgBouncer's older cousin) and PL/Proxy to spread write load across hash-partitioned Postgres while keeping snapshot semantics inside each shard.

Non-B-tree index families in production:

OpenStreetMap (osm2pgsql) — every spatial query against the planet-scale geometry table requires USING gist (way); without GiST the queries fall back to a sequential scan over hundreds of GB.
TimescaleDB hypertables — auto-create a B-tree on the time column but recommend an additional BRIN index for range scans across compressed chunks, cutting index size from GB to MB.
Datadog host metadata — handles ~25,000 upserts per second on a Postgres table keyed by host id; planner stats and the right composite B-tree determine whether the upsert hits the index or scans.
PostgreSQL full-text search at Sentry-scale issue search — GIN indexes on tsvector columns keep lookup latency dependent on match count, not table size, even at hundreds of millions of rows.
Supabase — exposes pgvector with HNSW indexes as a first-class feature so every Supabase project can do sub-20ms vector similarity over OpenAI-class 1536-dim embeddings.
Crunchy Data customer deployments — publish HNSW production guides recommending m=16, ef_construction=128, and CREATE INDEX CONCURRENTLY to avoid blocking writes during multi-hour builds.
Cloudflare (Stream, Images, Dashboard) — runs Postgres internally for these products and uses GIN indexes on metadata jsonb columns to support flexible filter UIs without schema migrations.
Disqus — uses recursive CTEs over B-tree-indexed parent-child pointers to fetch threaded comment trees in a single query instead of N+1 lookups.
OpenAI-RAG-style apps on pgvector — store document chunks and 1536-dim embeddings side-by-side with HNSW indexes; halfvec quantisation halves the ~6 GB index size with negligible recall loss.
TigerData / Timescale columnstore — adds B-tree indexes on top of compressed columnstore chunks for 1,185× faster lookup queries; index size and compression trade off explicitly.
OneSignal — runs hundreds of Postgres clusters and publishes vacuum-tuning case studies on multi-billion-row notification tables that depend on BRIN-style block-range thinking.
pgrouting / OSM routing — uses GiST on geometry plus B-tree on graph-edge endpoints to serve Dijkstra-style routing queries over the OpenStreetMap planet.

Pause and recall¶

What two header fields on a tuple determine which transactions can see it?
When our UPDATE runs, how many physical tuples exist on disk for order 1234 right after the new value is written and before vacuum runs?
What rule does "write-ahead" in WAL actually enforce, in one sentence?
Under what conditions does an UPDATE qualify as a HOT update, skipping the B-tree write?
Which index type would you reach for first to support WHERE tags @> '["sale"]' on a jsonb column, and which one for WHERE ts BETWEEN ... AND ... on an append-only events table?
Why does stale ANALYZE output sometimes turn a 200 ms query into a 30 s query?
What does an Index-Only Scan need from Postgres beyond the index itself in order to skip the heap visit?
What is the relationship between LSN on a heap page and the WAL stream, in terms of crash recovery?

Interview Q&A¶

Q1. Walk me through what Postgres does physically when I run UPDATE orders SET status='paid' WHERE id=1234. A. The backend looks up id 1234 via the B-tree primary-key index, lands on the heap tuple, takes a row-level lock, and writes a new tuple with xmin=current_txid. The old tuple gets xmax=current_txid and a forward pointer to the new one. A WAL record describing both page changes is appended to the WAL buffer. On COMMIT, fsync flushes WAL through the new LSN; only after that does the client see "OK". The dirty heap page is written back to disk asynchronously at the next checkpoint. If status were indexed and HOT did not apply, a new B-tree leaf entry would also be inserted. Common wrong answer to avoid: "Postgres edits the row in place and writes the change to the log." Postgres never edits in place — that is the whole point of MVCC.

Q2. Explain MVCC to a senior engineer who knows MySQL InnoDB but not Postgres. A. Both implement MVCC. InnoDB keeps the latest version in place and writes older versions to an undo log; readers reconstruct old versions from undo. Postgres keeps all versions in the heap directly, each tagged with xmin/xmax, and readers walk the update chain inside the table itself. Postgres has no undo log; the cost shows up as dead tuples that VACUUM must reclaim. Snapshot semantics — readers never block writers, writers never block readers — are the same; the storage strategy is the difference. Common wrong answer to avoid: "MVCC means readers and writers don't lock each other." True but shallow; the storage-layout consequence is the staff-level part of the answer.

Q3. Why is WAL needed if shared_buffers eventually flush to disk anyway? A. Because the flush is asynchronous and unordered. A crash between COMMIT and the next checkpoint would lose any dirty pages still in memory. The WAL enforces a strict order: the log record describing a page change is durable before the page change can become visible to a committed transaction. On recovery, Postgres replays WAL forward from the last checkpoint and brings every page up to the LSN of the latest record that touched it. Common wrong answer to avoid: "WAL is for replication." Replication is a downstream consumer; the primary purpose of WAL is crash recovery and commit durability.

Q4. When would Postgres pick a Seq Scan over a B-tree Index Scan, and how would you confirm? A. When the predicate matches a large enough fraction of the table — typically more than 5–15% — that the random I/O of visiting many heap pages costs more than reading the whole table sequentially. The planner makes this call using pg_statistic histograms and n_distinct. Confirm with EXPLAIN (ANALYZE, BUFFERS) — look at estimated vs actual row counts. If estimates are wildly off, the cause is almost always stale or insufficient statistics; fix with ANALYZE or CREATE STATISTICS for correlated columns. Common wrong answer to avoid: "Postgres prefers Seq Scan because indexes are slow." Indexes are not slow; they are wrong for high-selectivity predicates.

Q5. When would you reach for GIN, GiST, or BRIN over a default B-tree? A. GIN for containment-style predicates on composite values — jsonb @>, full-text tsvector, array ops — where the indexed value is a set of tokens. GiST for non-orderable data with overlap semantics — spatial geometries via PostGIS, range types, exclusion constraints. BRIN for very large append-only tables where the indexed column correlates with physical row order — timestamps in observability tables, monotonic ids in event logs — because BRIN stores only per-block min/max and is two to three orders of magnitude smaller than the equivalent B-tree. Common wrong answer to avoid: "GIN is for JSON, GiST is for geo, BRIN is for big tables." Each phrase is half right; the actual selector is the predicate shape and the data distribution, not the column type.

Q6. What is a HOT update, and when does it fail to apply? A. A Heap-Only Tuple update writes the new tuple version into the same heap page as the old one and links them via t_ctid, leaving the B-tree index entry pointing at the old slot. The index automatically follows the chain to find the live version. HOT applies only when (a) no indexed column changed and (b) the same page has free space for the new tuple. It fails when an indexed column is updated, or the page is full — both force a new index entry, and on a heavily updated table that causes B-tree bloat. Fillfactor (defaulting to 100 for tables, 90 for indexes) is the tuning knob; lowering it leaves headroom for HOT. Common wrong answer to avoid: "HOT means the row stays hot in memory." Nothing to do with caching; HOT is about avoiding index updates.

Q7. A query that ran in 50 ms yesterday now takes 30 s. Same SQL, same data shape, mostly. Walk through how you would diagnose this in Postgres. A. Start with EXPLAIN (ANALYZE, BUFFERS) and compare estimated vs actual row counts. A 100× mismatch points at stale statistics; run ANALYZE on the relevant tables. Check pg_stat_user_tables for n_dead_tup — heavy bloat after a long transaction can push the planner toward Seq Scan unnecessarily. Look for plan flips around the selectivity crossover; the fix is often a covering index or CREATE STATISTICS for correlated columns. Finally, check pg_stat_activity for blocking long transactions that held back autovacuum. Common wrong answer to avoid: "Add an index." Adding indexes without confirming the plan change wastes write throughput and may not help; diagnosis precedes prescription.

Q8. Why is "exactly-once durability across replicas" hard even with synchronous replication? A. Synchronous replication waits for the WAL bytes to be acknowledged by a chosen replica before returning COMMIT. That gives durable persistence on the replica's disk, but it does not guarantee the replica has replayed the WAL into its data pages — there is replay lag distinct from receive lag. So a read against the replica immediately after commit can still miss the row. Use synchronous_commit = remote_apply to wait for replay too; the trade is higher commit latency. Common wrong answer to avoid: "Synchronous replication is exactly-once." It is "WAL durably received," not "change visible everywhere."

Apply now (10 min)¶

Step 1 — model the trace. Here is what one row of an "update trace audit" looks like for our running statement. Copy the shape.

Stage	Physical artefact	Where it lives
Find row	B-tree leaf entry `1234 → (0,2)`	index page in shared_buffers
Write new tuple	new tuple `(0,4)` with `xmin=500`	heap page in shared_buffers
Mark old tuple	old tuple `(0,2)` gets `xmax=500`	same heap page
Append WAL	one WAL record covering both edits	WAL buffer → pg_wal/ on fsync
Commit	fsync WAL up to commit LSN	durable on disk
Aftermath	dead tuple `(0,2)` waits for vacuum	heap

Step 2 — your turn. Pick one frequently-run UPDATE in your own service. For each stage in the table above, write down (a) which index is consulted, (b) whether HOT would apply, (c) what the WAL record roughly contains, (d) whether any replica is following. If you cannot answer (b), run EXPLAIN (ANALYZE, BUFFERS) on the matching SELECT and inspect pg_stat_user_tables for n_tup_hot_upd on the table.

Step 3 — sketch from memory. Close this page. On paper, draw the tuple-with-xmin/xmax diagram from section 2 next to the WAL/fsync flow from section 3. Label which arrows are synchronous (block COMMIT) and which are asynchronous (catch up at checkpoint). If you can produce this diagram cold, you can hold the MVCC + WAL story together end-to-end in an interview.

Bridge. Now you know what one UPDATE costs in physical terms. Next we move up the stack: how you actually use Postgres day-to-day — psql, schema design, queries, and reading EXPLAIN output without panic. → 02-psql-schema-queries-explain.md