Skip to content

03. Vacuum, replication, pooling — Postgres in production

~20 min read. Yesterday you sat at psql, designed a schema, and read an EXPLAIN plan. Today the same orders table has grown to 5 billion rows, autovacuum is falling behind, a Tuesday-evening connection storm is spawning 4,000 Postgres backends, and your CEO wants a read replica in Frankfurt by Friday. This is Postgres at scale — where the database doesn't break, but the operational shape around it does.

Builds on: 00-eli5.md, 01-mvcc-wal-indexes.md, and 02-psql-schema-queries-explain.md.

The earlier chapters told you what Postgres does on each row. This one tells you what goes wrong when you do that ten thousand times a second for two years. Three forces compound. MVCC leaves dead tuples that vacuum must clean. Single-primary architecture forces you to choose between streaming and logical replication for read scale and DR. Postgres' process-per-connection model collides with the cloud reality of 3,000 stateless app pods each wanting a connection. Master these three and you can keep a Postgres cluster healthy at 100x the load that breaks an unprepared team.

We continue the same threaded example. The orders table from chapters 1 and 2 has been in production for two years. It now holds 5 billion rows, 2 TB on disk. Every hour, 200 million row versions get marked dead by status updates. Autovacuum can't keep up. The status page goes red every Tuesday. This chapter is how a senior engineer fixes that — by partitioning orders by placed_at, tuning autovacuum per partition, putting PgBouncer in transaction mode in front of 3,000 Django pods, and adding a streaming replica in Frankfurt for the EU read path.


1) Autovacuum mechanics — xmin/xmax, dead tuples, visibility map

Recall from chapter 1 that every UPDATE in Postgres creates a new tuple and marks the old one with xmax = current_xid. The old tuple is not freed; it sits on the page until something proves no live transaction can still see it. That something is vacuum. Vacuum walks pages, finds tuples whose xmax is older than the oldest running transaction's snapshot, and frees the line pointer so the space can be reused for a future insert. Without it, the table would grow forever even with no net row growth.

Autovacuum is the background daemon that runs VACUUM automatically. It wakes up every autovacuum_naptime (default 1 minute), looks at each table's n_dead_tup counter in pg_stat_user_tables, and triggers a vacuum when dead tuples exceed autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples — out-of-the-box that means 20% of the table must turn over before vacuum runs. On a 5-billion-row table, 20% is 1 billion dead tuples; long before that the table is unusable.

The visibility map is the unsung hero. It is a bitmap with one bit per heap page, set when every tuple on the page is visible to all current transactions. Vacuum uses it to skip clean pages — without the visibility map, every vacuum would read every page. Index-Only Scans use the same bitmap to skip heap fetches. Keep the visibility map fresh and your hot tables stay cheap; let it go stale and reads and vacuum both get more expensive.

AUTOVACUUM WAKE → SCAN → CLEAN cycle

   ┌─────────────────┐
   │ autovacuum nap  │  every 1 min
   └────────┬────────┘
   ┌─────────────────────────────────────┐
   │ pg_stat_user_tables.n_dead_tup      │
   │ > threshold + scale * reltuples?    │
   └────────┬────────────────────────────┘
       yes  ▼                       no  → back to nap
   ┌─────────────────────────────────────┐
   │ launch worker (max autovacuum_max_workers)
   └────────┬────────────────────────────┘
   ┌─────────────────────────────────────┐
   │ for each heap page:                 │
   │   if visibility_map[p] = all-visible → SKIP
   │   else: read page, mark dead tuples free,
   │         update visibility_map,
   │         update free space map        │
   └────────┬────────────────────────────┘
   ┌─────────────────────────────────────┐
   │ also vacuum each index of the table │
   └─────────────────────────────────────┘

Why autovacuum falls behind on big write-heavy tables comes down to three throttles. autovacuum_vacuum_cost_limit caps the I/O work it can do per cycle — too low and it can't keep up; the GitLab team raised theirs to 3000 then tuned cost-delay to 5ms. autovacuum_max_workers (default 3) limits parallelism — three workers cannot simultaneously vacuum your hundred largest tables. And the killer: a single long-running transaction freezes the cleanup horizon. A 30-minute analytics query holds an old snapshot; vacuum cannot reclaim any tuple newer than that snapshot, across the entire database, until the query finishes. One bad BI dashboard query has caused more vacuum incidents than every misconfigured worker count combined.

Teacher voice. Vacuum is not optional maintenance. It is the second half of every UPDATE Postgres does — the half you cannot see in the query but that determines whether your database is alive in six months. Treat autovacuum settings as part of your schema, not part of your DBA's secret.

For our orders table: with 200M dead tuples per hour and the default 20% scale factor, autovacuum waits until 1 billion dead tuples accumulate — five hours of garbage. By then the next vacuum is so expensive it runs for 90 minutes and blocks the next one. We override per table: ALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_limit = 3000). Now vacuum fires at 50M dead tuples, finishes in 5 minutes, and the table never goes more than 15 minutes without cleanup.


2) Bloat — index vs table bloat, detecting it, the cost of doing nothing

Even with vacuum running on schedule, bloat accumulates. Bloat is dead space inside pages — line pointers freed but page not compacted, or index entries left pointing at dead heap tuples. Vacuum reclaims space for reuse inside the file but never shrinks the file. A table that was once 200 GB at 100% live-tuple density can stay 200 GB at 30% live density forever, with 140 GB of wasted disk and cache.

Table bloat and index bloat are different beasts. Table bloat happens when updates leave tombstones scattered across pages and new inserts can't reuse the holes (the fillfactor parameter exists for exactly this). Index bloat is worse: autovacuum cannot reclaim space inside a B-tree index after deletes — it only marks space available for re-insertion at the same key range. Indexes on monotonic keys (bigserial, timestamps) bloat the worst because deletes happen at the low end while inserts happen at the high end; the freed space is never reused. Datadog's Postgres integration explicitly tracks postgresql.index_bloat as a first-class metric for this reason.

Detecting bloat starts with pgstattuple:

CREATE EXTENSION pgstattuple;
SELECT * FROM pgstattuple('orders');
-- table_len 2.1 TB, tuple_count 5.0B, dead_tuple_count 1.2B,
-- free_space 480 GB, dead_tuple_percent 24%
SELECT * FROM pgstatindex('orders_pkey');
-- avg_leaf_density 51% — should be ~90% on a healthy fresh B-tree

Twenty-four percent dead and 51% leaf density is "fix this Tuesday" territory. The fix is either VACUUM FULL (exclusive lock, rewrites the table, never acceptable on a live primary above ~10 GB) or pg_repack (online — Crunchy Data and AWS RDS both publish guides; only holds an ACCESS EXCLUSIVE lock briefly at the final swap, the rest is ACCESS SHARE so inserts and updates proceed). For indexes alone, REINDEX INDEX CONCURRENTLY orders_pkey (PG12+) rebuilds without blocking writes. Data Egret's pg_index_watch automates exactly this loop on indexes whose bloat trends past a threshold.

The cost of doing nothing compounds in three ways. Cache pressure: 140 GB of dead pages pushes hot pages out of shared_buffers, so the working set falls to disk. Query cost: every Seq Scan and every Index Scan must walk those dead pages or entries even if vacuum has marked them. WAL volume: every write to a bloated table writes more WAL than it should, slowing replication. One Postgres team I followed measured a 3x latency p99 regression on their checkout API caused purely by index bloat that nobody had noticed for six months.

Mini-FAQ. "Won't autovacuum eventually fix bloat?" No. Autovacuum reclaims space inside pages for reuse by future inserts. It does not return the space to the OS, does not defragment indexes, and cannot help a bloated index whose keys are monotonic. For real shrinkage you need pg_repack or a REINDEX.


3) Transaction ID wraparound — the 2-billion XID horizon

This one is a singularity, not a gradient. Every Postgres transaction gets a 32-bit XID. Comparisons are modular — at any moment, half the XID space is "past" and half is "future" relative to a given transaction. When a tuple's xmin falls more than 2 billion XIDs behind the current XID counter, Postgres can no longer tell whether that tuple is older or newer than the current transaction, so it must guarantee that doesn't happen.

The guarantee is freezing. Before a tuple's xmin ages past autovacuum_freeze_max_age (default 200 million XIDs), autovacuum rewrites it with a special FrozenTransactionId that always compares as "in the past." Autovacuum launches anti-wraparound vacuum jobs specifically for tables whose oldest tuple is approaching the horizon — these jobs cannot be cancelled and run even on tables that haven't accumulated enough dead tuples to trigger normal vacuum.

If you ignore wraparound long enough, Postgres protects itself by going into emergency read-only mode 1 million XIDs before the limit. At ~10 million XIDs out, the database refuses new writes; at the limit, it refuses everything. Mailchimp's Mandrill outage in February 2019 was exactly this: one of five Postgres instances hit wraparound after a write spike, the database forced itself read-only, and recovery required truncations and manual VACUUM FREEZE — total downtime around 40 hours. Sentry's 2024 incident followed the same pattern when autovacuum couldn't keep up with freezing on a heavily-updated table. Both teams published detailed post-mortems, and both root causes were "autovacuum was disabled or too slow on the wrong table."

XID SPACE — 32 BITS, MOD 2^32

  current XID ─┐
  ←─────────────────────────────────────────────────────→
   2 billion "past"            2 billion "future"

  freeze every tuple older than ~200M XIDs:
   ─ ── ── ── ─FROZEN─ ── ── ── ──◄── live xids ──┤
                                                  current

  if you let oldest unfrozen xid drift past 2B from current:
  → cannot distinguish past from future → DB goes read-only

The monitoring query every senior should know:

SELECT datname, age(datfrozenxid) AS xid_age,
       (2^31 - 1)::bigint - age(datfrozenxid) AS xids_left
FROM pg_database ORDER BY xid_age DESC;

Below 200M is healthy; 500M is "investigate today"; 1B is "page someone"; 1.5B is "all hands." AWS RDS exposes this via the postgres_get_av_diag() function and CloudWatch metric MaximumUsedTransactionIDs.

For our orders table at 5B rows and 200M updates per hour: at the default freeze settings, autovacuum freezes ~200M XIDs every hour or so. The math is comfortable if anti-wraparound jobs are not being killed by lock contention with ALTER TABLE or by long-running transactions. Watch pg_stat_progress_vacuum to see them running.


4) Streaming replication vs logical replication

A single Postgres primary cannot serve all your reads, and certainly cannot survive losing its AZ. You need replicas. Postgres ships two fundamentally different replication modes, and choosing wrong is a year-long migration to fix.

Streaming (physical) replication ships the raw WAL byte-for-byte to a standby. The standby applies WAL records exactly as the primary wrote them, so its data files are bit-identical to the primary's. Every table, every index, every system catalog, every page. Setup: pg_basebackup clones the primary, then the standby connects and consumes WAL via the walsender process. With hot_standby = on, the standby answers read-only queries while it's still applying WAL. Failover is a pg_ctl promote. This is what AWS RDS Multi-AZ, Heroku Postgres followers, and Crunchy Bridge HA all use under the hood.

Logical replication ships row-level changes — INSERT (row data), UPDATE (old key + new row), DELETE (row key) — decoded from the WAL by an output plugin (pgoutput). The subscriber applies the changes as regular SQL operations against its own tables. The subscriber can be a different major version, a subset of tables, or even MySQL or BigQuery via Debezium. Setup: CREATE PUBLICATION ... FOR TABLE orders, line_items; CREATE SUBSCRIPTION ... CONNECTION ... PUBLICATION ....

Streaming (physical) Logical
Granularity Whole cluster, byte-identical Per-table, per-row
Cross-version Same major version only Different versions OK
Read scaling Yes, hot standby Yes, but lag is per-table
Cascading filter (subset) No Yes — pick tables
DDL replicated Yes (it's bytes) No (manual on subscriber, PG15+ has limited support)
Failover (HA) Designed for it Not really — convergence semantics fuzzy
Typical use DR, read replicas Sharding, version upgrade, CDC

The two case studies that define this choice in 2026 are Notion and Figma. Notion runs 480 logical shards on 32 physical instances, and used logical replication to copy historical data when expanding from 32 to 96 physical instances — they reduced migration time from 3 days to 12 hours by delaying index creation until after data transfer. Figma similarly used logical replication to vertically partition their monolith — splitting "Files" tables onto their own database — and to set up a reverse logical replication stream from the new database back to the old one, giving them safe rollback if the cutover went sideways. Figma's published architecture explicitly chose logical over physical because they needed table subsets, different major versions during migration, and rollback capability — all things physical replication cannot give you.

Teacher voice. Streaming replication is for DR and read scale on the same data. Logical replication is for changing the shape — fewer tables, different version, different shard, different system. Pick by what you need to be different between primary and replica.

For our orders table: we add two streaming replicas in us-east-1 (one for analytics, one for HA failover) plus one cascaded streaming replica in eu-central-1 for EU reads. For the eventual move to schema-sharded orders (chapter 4), we'll use logical replication to migrate workspace by workspace.


5) PgBouncer — pooling modes and the prepared-statement trap

Postgres uses one OS process per connection. Each backend costs ~5-10 MB of RAM and one process-table slot. A db.r6g.large RDS instance with max_connections = 200 can comfortably serve 200 backends; ask it for 3,000 and it OOMs the host. Meanwhile, a Django fleet of 3,000 pods with 2 workers each wants 6,000 connections at peak. The arithmetic does not work.

PgBouncer solves this. It is a single-binary connection multiplexer that maintains a small pool of long-lived backend connections and multiplexes thousands of short-lived client connections onto them. The trick is in when PgBouncer returns a backend connection to the pool, and that is the pooling-mode choice.

THREE POOLING MODES, ONE CONFIG KNOB

SESSION POOLING                    
  client connects ─────────────────► backend held until client disconnects
  ── concurrency = max_client_conn, but only up to server_pool_size active
  ── safe for everything, but barely better than no pooling

TRANSACTION POOLING                
  client connects ──► BEGIN ──► COMMIT/ROLLBACK ──► backend returned to pool
                      ▲                                       │
                      └───── next BEGIN → maybe a different backend
  ── concurrency = unlimited clients, server_pool_size active txns
  ── breaks: session GUCs (SET LOCAL only), advisory locks, LISTEN/NOTIFY,
            prepared statements pre-PgBouncer 1.21

STATEMENT POOLING                  
  client connects ──► one statement ──► backend returned
  ── no multi-statement transactions allowed at all
  ── almost no one uses this; transaction pooling is the sweet spot

Transaction mode is the default on Crunchy Bridge, the recommended mode on Heroku Postgres connection pooling, and what every high-scale Postgres-on-Postgres shop runs. The cost is a list of footguns. SET search_path = ... outside BEGIN no longer survives the next statement — use SET LOCAL inside the transaction. Advisory locks (pg_advisory_lock) silently leak. LISTEN/NOTIFY does not work across pool returns. Server-side cursors break. And the famous one: prepared statements prepared on one backend cannot be executed on a different backend.

This last one bit every Django, JDBC, and psycopg user for a decade. Before PgBouncer 1.21 (released early 2024), the only fix was to disable prepared statements at the driver — prepareThreshold=0 in JDBC, prepare_threshold=None in psycopg3, DISABLE_SERVER_SIDE_CURSORS=True in Django. PgBouncer 1.21 added protocol-level prepared statement tracking that prepares the statement on whichever backend the client lands on; set max_prepared_statements = 200 in pgbouncer.ini to enable. Production reports from the OpenSourceDB and JP Camara writeups suggest this works well but adds memory per pool. Many shops are still on PgBouncer 1.18-1.20 and have not turned it on; check your version.

The size knobs that matter:

Knob What it caps Sensible value (PgBouncer 1.21, db.r6g.xlarge primary)
max_client_conn Total client connections PgBouncer accepts 10,000-50,000
default_pool_size Backend connections per (user, database) pair 20-50
reserve_pool_size Extra backends during traffic spikes 5-10
server_idle_timeout When idle backends get closed 600 s
max_prepared_statements Per-pool prepared stmt cache size 200-500 (PG 1.21+)
Backend max_connections Postgres' hard limit 200-400

The Crunchy Data "running multiple PgBouncers" pattern layers them: one PgBouncer near the app for client-to-PgBouncer multiplexing, a second near Postgres for PgBouncer-to-server final funneling. Helpful when geography matters. Newer alternatives — pgcat (Rust rewrite with query routing, multi-tenant sharding, read-replica routing) and Supavisor (Supabase's Elixir-based pooler, scales to millions of connections per node) — are PgBouncer-compatible and gaining ground; Tembo's 2024 benchmarks show PgCat overtaking PgBouncer past ~50 concurrent clients.

For our orders table: PgBouncer in transaction mode, default_pool_size = 30, in front of 3,000 Django pods. Effective concurrency: 90,000 client connections multiplexed onto 30 backends. Prepared statements enabled via PgBouncer 1.21; Django's CONN_MAX_AGE set to 0 because PgBouncer holds the persistent connections, not the app.


6) Partitioning — range, list, hash, and partition pruning

Our 5B-row orders table cannot stay one table. Every vacuum reads 2 TB. Every index is 200 GB. Backup takes 4 hours. The fix is partitioning — split one logical table into many smaller physical tables, transparent to the application via the parent table.

Postgres supports three partition strategies. Range partitioning is the workhorse: split by a continuous key, almost always time (placed_at). List partitioning splits by a categorical key (country_code for residency, tenant_id for multi-tenancy). Hash partitioning spreads writes evenly when there's no natural locality — handy for sharding write traffic but useless for partition pruning on range scans.

CREATE TABLE orders (
  id bigserial,
  user_id bigint,
  status order_status,
  total_cents bigint,
  placed_at timestamptz NOT NULL,
  PRIMARY KEY (id, placed_at)            -- partition key must be in PK
) PARTITION BY RANGE (placed_at);

CREATE TABLE orders_2026_05 PARTITION OF orders
  FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');
CREATE TABLE orders_2026_06 PARTITION OF orders
  FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
-- ... pg_partman + pg_cron automates creation

Pre-PG10 you simulated this with table inheritance and CHECK constraints — Reddit, OpenStreetMap, and older Postgres deployments still carry inheritance-based partitioning in their history. Declarative partitioning (PG10+) replaces all that with native syntax, and PG12+ made the planner fast enough to handle thousands of partitions without overhead.

The win is partition pruning — when a query has WHERE placed_at >= '2026-05-01' AND placed_at < '2026-05-15', the planner skips every partition outside that range. EXPLAIN shows it explicitly:

Append  (cost=0.43..2451.20 rows=187 width=84)
  ->  Index Scan using orders_2026_05_pkey on orders_2026_05
        Index Cond: (placed_at >= '2026-05-01' AND placed_at < '2026-05-15')
  (other partitions pruned — only 1 of 60 partitions scanned)

One 30-GB partition scanned instead of 2 TB. Vacuum runs per partition, so a 30-GB vacuum completes in 90 seconds instead of 90 minutes. Indexes are 200 MB per partition instead of 200 GB. Dropping last year's data is DROP TABLE orders_2025_05 — an instant DDL — instead of a 12-hour DELETE.

The constraints that bite you in production. Foreign keys referencing the partition parent work (PG12+), but cascades across many partitions can be slow — Sentry's troubleshooting docs explicitly call this out. Unique constraints must include the partition key, which often means dropping UNIQUE(id) in favor of UNIQUE(id, placed_at) — surprising for application code that expects id alone to be unique. Cross-partition updates that move a row from one partition to another (UPDATE ... SET placed_at = ...) work but are expensive — they delete from one partition and insert into another. The TigerData (Timescale) hypertable approach automates much of this for time-series; pg_partman + pg_cron is the more general alternative, used by Reddit, AWS Aurora customers, and many Citus-on-Postgres deployments.

For our orders table: partition by placed_at monthly. Sixty monthly partitions for the live window, archive older partitions to cheaper storage. Per-partition autovacuum settings: hot current month gets vacuum_scale_factor = 0.01; older read-mostly months get the default 0.2. The Tuesday-evening status updates now bloat only the current month's partition.

Teacher voice. Partition before you must, not after. Once a table is past 100 GB, online partitioning is its own engineering project — pg_partman, double-write phase, backfill, swap. Partition the table when you create it if you can see growth coming.


7) Comparison table — pooling modes at scale

PgBouncer 1.21 on Postgres 16, AWS db.r6g.xlarge primary (4 vCPU, 32 GB), 3,000 Django pods, p99 latency for a simple SELECT user_id, status FROM orders WHERE id = $1.

Pooling mode Max effective clients Active backends Prepared-stmt safe? Adv. lock / LISTEN safe? p99 latency overhead Memory per 1k clients
None (direct) ~200 (max_connections) 200 Yes Yes 0 ms ~5,000 MB (on backend)
Session pooling max_client_conn (10k+) server_pool_size Yes Yes ~0.1 ms ~3 MB (PgBouncer)
Transaction pooling max_client_conn (50k+) default_pool_size Yes (PG 1.21+ only) No ~0.3-0.5 ms ~2 MB (PgBouncer)
Statement pooling max_client_conn default_pool_size No No ~0.5 ms ~2 MB (PgBouncer)
pgcat transaction max_client_conn (100k+) per-shard pool Yes No ~0.4 ms ~3 MB (PgCat)
Supavisor (Elixir) millions per node configurable Yes No ~1-2 ms ~4 MB

The 62% cost reduction and 81% latency improvement from one DZone PgBouncer-at-scale case study come from collapsing 10,000 client connections onto 30 backends in transaction mode. The absolute latency floor for any pooler is dominated by the extra TCP hop (~0.3 ms intra-AZ on AWS); the variance comes from prepared-statement handling and lock semantics, not the pooler binary itself.


Where this lives in the wild

These are sourced from engineering blogs, vendor docs, and published post-mortems. Two categories — vacuum/bloat/wraparound stories, and replication/pooling/partitioning at scale.

Vacuum, bloat, and wraparound:

  • GitLab Infrastructure (production runbooks) — tuned autovacuum_vacuum_scale_factor to 0.010, autovacuum_vacuum_cost_limit to 3000, and 10 autovacuum workers across their 12 TiB primary; documents that "if cleaning your house hurts, do it more often."
  • GitLab Database Team — supplements autovacuum with scheduled vacuumdb cron jobs on hot tables (ci_stages, ci_builds_metadata, notes) during low-load windows.
  • Sentry Postgres outage (2024) — autovacuum couldn't keep up with freezing on a heavily-updated table; database went read-only at the wraparound limit, recovery required manual VACUUM FREEZE.
  • Mailchimp Mandrill outage (Feb 2019) — one of five Postgres instances hit XID wraparound after a write spike; ~40 hours of downtime, public post-mortem.
  • AWS RDS — exposes wraparound risk via MaximumUsedTransactionIDs CloudWatch metric and the postgres_get_av_diag() helper function.
  • Crunchy Data — publishes the canonical "managing transaction ID exhaustion" guide and the pg_repack bloat-removal walkthrough.
  • Data Egret pg_index_watch — open-source utility that automates REINDEX CONCURRENTLY when index bloat trends past a threshold; created specifically because autovacuum cannot fix index bloat.
  • Datadog Postgres integration — exposes postgresql.index_bloat and postgresql.table_bloat as first-class metrics via collect_bloat_metrics.
  • Cybertec Postgres — publishes the canonical "transaction ID wraparound: a walk on the wild side" deep dive used by half the field.
  • EnterpriseDB (EDB) "PG Phriday" — long-form writeup on wraparound symptoms and the emergency single-user-mode recovery path.
  • Stormatics — publishes the practical pg_repack removing-bloat guide referenced by many DBA teams.

Replication, pooling, and partitioning at scale:

  • Notion — 480 logical shards across 32 physical Postgres instances; used logical replication to expand to 96 instances, cutting migration time from 3 days to 12 hours by deferring index creation.
  • Figma — uses logical replication for vertical partitioning of their monolith and for reverse-replication rollback safety; shipped first horizontally sharded table in Sep 2023 with 10 seconds of partial availability.
  • Figma DBProxy — custom service intercepting SQL and routing to vertically-partitioned Postgres instances based on table-to-database mapping.
  • Instagram — early Postgres schema-per-shard model with 4,096 logical shards on a small number of physical machines, scaled by moving schemas across machines.
  • Heroku Postgres connection pooling — generally-available managed PgBouncer in transaction mode; supports PGBOUNCER_URLS for leader+follower fan-out to multiple databases.
  • Crunchy Bridge — PgBouncer enabled by default in transaction-pooling mode; install crunchy_pooler extension to opt in per database.
  • Supabase Supavisor — Elixir-based PgBouncer-compatible pooler designed for millions of connections per node; replaces PgBouncer on Supabase managed Postgres.
  • pgcat — Rust PgBouncer alternative with query routing, read-replica awareness, multi-tenant sharding; outperforms PgBouncer past ~50 clients in Tembo benchmarks.
  • AWS RDS Postgres Multi-AZ — synchronous streaming replication to a standby in another AZ for automatic failover.
  • AWS Aurora PostgreSQL — replaces WAL streaming with shared-storage replication; six copies across three AZs at the storage layer.
  • Citus (now Microsoft Azure Cosmos DB for PostgreSQL) — hash-distributes Postgres tables across nodes; pairs with pg_partman for time-series sharding-plus-partitioning.
  • TimescaleDB / TigerData hypertables — automatic time-based partitioning with chunk management, used widely for IoT and metrics workloads.
  • pg_partman + pg_cron — the canonical declarative-partitioning automation stack on AWS RDS, Azure, and Crunchy Bridge; documented by AWS for time-series ingestion.
  • Reddit — partitions media-metadata tables via pg_partman with CHECK-constraint-driven partition pruning.
  • Stripe Sync Engine — uses logical replication via the pgoutput plugin to keep customer-side Postgres tables synced with the Stripe API.
  • Debezium — open-source CDC layer that uses Postgres logical replication slots (pgoutput or wal2json) to stream changes into Kafka.
  • DZone PgBouncer-at-scale case study — 10,000+ concurrent connections across 500+ tenants, 62% cost reduction, 81% latency improvement after switching to transaction-mode PgBouncer.
  • JP Camara "PgBouncer is useful, important, and fraught with peril" — widely-cited writeup of the prepared-statement and advisory-lock footguns in transaction mode.
  • OpenSourceDB blog — walkthrough of solving prepared-statement issues by upgrading to PgBouncer 1.21 with max_prepared_statements.

Pause and recall

  1. Why does autovacuum need the visibility map, and what does Index-Only Scan share with it?
  2. A 30-minute analytics query is running on your primary. Which two production problems get worse during that 30 minutes, and why?
  3. Autovacuum reclaims dead-tuple space. What does it not do that pg_repack does?
  4. At what point in the 32-bit XID horizon does Postgres go read-only, and how do you monitor distance to it?
  5. Streaming replication or logical replication: which one would you pick for a major-version upgrade with minimal downtime, and why?
  6. In PgBouncer transaction-mode, name three Postgres features that silently break, and one that was fixed in version 1.21.
  7. Your orders table is partitioned by placed_at monthly. The query SELECT * FROM orders WHERE user_id = 9001 — does partition pruning help? Why or why not?
  8. What's the difference between table bloat and index bloat, and why does autovacuum fix one and not the other?

Interview Q&A

Q1. Walk me through what happens if autovacuum is disabled on a busy orders table for two weeks. A. Dead tuples accumulate without bound; table and indexes bloat (often 2-5x); cache pressure increases as dead pages displace hot pages; query plans degrade as the planner sees stale statistics; the visibility map goes stale, so Index-Only Scans degrade to Index Scans; eventually the oldest unfrozen XID drifts toward the 2-billion horizon and the database enters emergency read-only mode. Recovery is a VACUUM FULL or pg_repack on every affected table plus emergency VACUUM FREEZE if wraparound is imminent. This is the Mailchimp Mandrill scenario. Common wrong answer to avoid: "Postgres will just slow down a bit." It does not — it goes read-only at the wraparound limit, hard stop on writes.

Q2. Streaming vs logical replication — when do you reach for each? A. Streaming for DR and read scale on identical-shape data — Multi-AZ failover, read replicas in the same region, hot standby. Logical for changing the shape between primary and subscriber — major-version upgrades with reverse-replication rollback (Figma), shard migrations (Notion), CDC into Kafka (Debezium), or selective table replication. Streaming is cheaper, simpler, and guaranteed identical; logical is more flexible but lags more under heavy write load and does not replicate DDL by default. Common wrong answer to avoid: "Logical is just better because it's row-level." Logical has higher overhead per row, no DDL replication, and weaker convergence semantics for HA.

Q3. A teammate suggests transaction-mode PgBouncer for everything. What footguns do you raise? A. Session-scoped features break: SET outside BEGIN (must be SET LOCAL), advisory locks, LISTEN/NOTIFY, server-side cursors, and (pre-PgBouncer 1.21) server-side prepared statements. Application frameworks that assume connection affinity — Django's CONN_MAX_AGE interacting with pool returns, JDBC's prepared-statement cache — need driver-level changes. Modern PgBouncer 1.21 fixes the prepared-statement issue via protocol-level tracking with max_prepared_statements, but advisory locks and LISTEN/NOTIFY still break. Common wrong answer to avoid: "Just turn it on, applications won't notice." They will notice — usually at 2 a.m. when a celery task assumes its advisory lock is held across statements.

Q4. Your orders table is at 5B rows and 2 TB. How would you partition it, and what breaks? A. Range partition by placed_at monthly using declarative partitioning (PG10+); automate creation with pg_partman + pg_cron. Per-partition autovacuum tuning: aggressive on the hot current month, default on archive months. What breaks: unique constraints must include the partition key (so UNIQUE(id) becomes UNIQUE(id, placed_at)); cross-partition updates that move rows are expensive; foreign keys to the partitioned parent work but cascades scan all partitions. Backfilling existing data needs pg_partman's online-migration helper or a custom double-write phase. Common wrong answer to avoid: "Hash partition by id for even spread." Hash-by-id defeats partition pruning for the most common queries (date range, user history) and gives you all the partition complexity with none of the operational wins.

Q5. Explain transaction ID wraparound to a junior engineer in 60 seconds. A. Postgres tags every row with the 32-bit ID of the transaction that created it (xmin) and deleted it (xmax). Comparing two XIDs is modular — Postgres can only tell which is "older" if they're within 2 billion of each other. To stay safe, autovacuum periodically rewrites old xmin values to a special "frozen" marker that always counts as past. If autovacuum can't keep up — disabled, throttled too aggressively, or blocked by a long-running transaction — old XIDs drift toward the 2-billion horizon. At ~10 million XIDs left, Postgres refuses writes to protect itself. The fix is autovacuum running aggressively enough that freeze never falls behind. Common wrong answer to avoid: "Just restart Postgres." Restart does nothing — the unfrozen tuples are still on disk; only VACUUM FREEZE makes them safe.

Q6. PgBouncer says "no more connections in pool" under load. Walk me through diagnosis. A. First check SHOW POOLS in PgBouncer admin console — look at cl_active, cl_waiting, sv_active, sv_idle. If cl_waiting > 0 and sv_idle = 0, the pool is fully busy on legitimate long transactions — increase default_pool_size if the backend can handle more connections, or find the slow queries holding backends (pg_stat_activity with state = 'active' and long xact_start). If sv_idle > 0 but clients still wait, you're hitting max_client_conn. If sv_active is climbing without bound, a transaction is leaking — usually an app forgot to commit or an advisory lock never released. Common wrong answer to avoid: "Increase max_connections on Postgres." That just moves the OOM to the database host; the pool exists precisely to keep Postgres connection count bounded.

Q7. When would you use logical replication instead of streaming for a multi-region setup? A. When the regions need to differ in shape — different tables (EU subscriber gets a subset for GDPR residency), different versions during a rolling upgrade, or different schemas during a migration. Also when the secondary needs to accept writes (logical replicas can; streaming hot standbys cannot). Trade-off is replication lag is per-table and can grow under heavy write load; DDL is not replicated; and there's no built-in HA failover story. Common wrong answer to avoid: "Use logical because the network is cheaper." It is not — logical decodes WAL into row events, often larger than the raw WAL byte stream for narrow tables.

Q8. How do you avoid the "long-running transaction blocks vacuum" problem? A. Three layers. First, set statement_timeout and idle_in_transaction_session_timeout aggressively (e.g., 30s and 60s) so runaway queries get killed. Second, route long analytics queries to a read replica with a different snapshot horizon — even better, a logical-replica analytics database. Third, monitor pg_stat_activity for xact_start older than ~10 minutes and page the on-call; this is the GitLab autovacuum runbook pattern. The structural fix is partitioning so vacuum operates on small partitions, not the whole table, and one stuck transaction blocks only the snapshot horizon — not the per-partition vacuum work. Common wrong answer to avoid: "Just increase autovacuum workers." Workers don't help if the oldest snapshot horizon is held by a 30-minute query — vacuum cannot reclaim any tuple newer than that snapshot, no matter how many workers run.


Apply now (10 min)

Step 1 — model the diagnosis. Here is what one row of an "autovacuum health audit" looks like for our threaded orders example. Copy the shape.

Field Value
Table orders (5B rows, 2 TB, 60 monthly partitions)
pg_stat_user_tables.n_dead_tup (hot partition) 84M (vs threshold 50M)
Last autovacuum 12 minutes ago, ran for 4m 20s
pgstattuple.dead_tuple_percent 4.2% (healthy: <10%)
age(datfrozenxid) 320M XIDs (healthy: <500M)
Long-running txns One BI query, 8 minutes — investigate
PgBouncer cl_waiting 0
Replica lag (pg_stat_replication) 220 ms (healthy: <1s)
Diagnosis Healthy; kill the BI query if it grows past 20m
Action None now; alert if n_dead_tup > 100M on hot partition

Step 2 — your turn. Pick your busiest Postgres table in production. Fill the row above with real numbers from pg_stat_user_tables, pgstattuple, pg_database.datfrozenxid, pg_stat_activity, and PgBouncer SHOW POOLS. For each row, write whether it is in the green band, the yellow band, or red. If anything is yellow, write one action; if anything is red, write a runbook step.

Step 3 — sketch from memory. Close this page. On paper, draw the autovacuum wake-scan-clean cycle diagram from §1 — the loop from naptime through visibility-map skip through index vacuum. Then on the same page, draw the three PgBouncer pooling modes from §5 with their concurrency math. If you can produce both cold, you can debug any vacuum or pool incident an interviewer hands you.


Bridge. You can now keep a Postgres cluster healthy under continuous write load — vacuum tuned, replication sized, pooling correct, partitioning automated. The last question is the comparative one every staff interview asks: when should this be Postgres at all, when MySQL, when DynamoDB, when sharded? That is the next chapter. → 04-vs-mysql-nosql-sharding-interview.md