Skip to content

02. psql, schema, queries, EXPLAIN — Postgres in daily use

~18 min read. Yesterday you traced one UPDATE through MVCC and WAL. Today you sit in front of the terminal where you actually issue that statement, design the table it lands on, write the query that fetches it back, and read the plan when it goes slow. This is Postgres at the keyboard.

Builds on: 00-eli5.md and 01-mvcc-wal-indexes.md.

The previous chapter ended with one tuple, one WAL record, one B-tree leaf. That was Postgres from the inside out. This chapter flips the camera. You are now a senior engineer with a deadline, an orders table that ballooned to 50M rows, a support ticket that says "checkout is slow on Tuesdays," and a psql prompt blinking at you. Same database, same MVCC, same WAL — but the day-to-day questions are different. What does \d orders actually tell you? Is jsonb the right type for the cart payload? Why does this query do a Seq Scan when there is clearly an index? And how do you read EXPLAIN (ANALYZE, BUFFERS) output without zoning out at the third Nested Loop?

We continue the same threaded example. Three tables:

users(id bigserial primary key, email citext unique, created_at timestamptz);
orders(id bigserial primary key, user_id bigint references users(id),
       status order_status, total_cents bigint, placed_at timestamptz,
       shipping_address jsonb);
line_items(id bigserial primary key, order_id bigint references orders(id)
           on delete cascade, sku text, qty int, unit_cents bigint);

And one statement that keeps showing up: UPDATE orders SET status='paid' WHERE id=1234. Now we ask the day-to-day questions around it.


1) The psql survival kit

psql is the CLI you will live in. Not pgAdmin, not DBeaver — those are fine for browsing, but every senior engineer I have worked with debugs production from psql because the meta-commands tell you exactly what the database thinks the schema is, with zero translation layer. Learn six of them and you will outpace half the team.

\d orders shows the table shape: columns, types, nullability, defaults, indexes, foreign-key constraints, check constraints, triggers, replication identity. It is the single most-used command. \d+ orders adds storage parameters (fillfactor, toast options) and per-column statistics targets. \dt lists all tables in the current schema; \dt *.* lists across all schemas. \df your_function_name describes a function. \di+ lists indexes with sizes — surprisingly useful when you suspect bloat. \sf+ some_function prints the source of a function with line numbers.

The settings that change the feel of psql come from ~/.psqlrc:

\set ON_ERROR_STOP on
\timing on
\set HISTFILE ~/.psql_history- :DBNAME
\pset null '∅'
\pset pager off

\timing on prints query latency after every statement — non-negotiable. \set ON_ERROR_STOP on makes scripts halt on the first error instead of marching past 47 broken statements into a corrupt state. \pset null '∅' makes NULLs visible. The HISTFILE line gives you per-database command history, so the migration you ran on staging last week is still there when you reach for it.

Two power features deserve flagging. \watch 2 re-runs the previous query every two seconds — perfect for watching SELECT count(*) FROM pg_stat_activity WHERE state='active' during an incident. And \copy orders TO 'orders.csv' WITH CSV HEADER is the client-side cousin of server-side COPY: it streams the file through your psql session rather than requiring filesystem access on the database server. On managed services like RDS, Heroku Postgres, or Supabase, you cannot reach the server filesystem at all, so \copy is the only game in town.

Teacher voice. The single biggest psql habit that separates juniors from seniors: always start a destructive session with \set ON_ERROR_STOP on and always wrap multi-statement work in BEGIN; ... COMMIT;. A misplaced semicolon should not be allowed to commit half a migration.

For our running example, day one in psql looks like:

prod=> \timing on
prod=> \d orders
                                          Table "public.orders"
    Column      |            Type             | Nullable |              Default
----------------+-----------------------------+----------+----------------------------------
 id             | bigint                      | not null | nextval('orders_id_seq'::regclass)
 user_id        | bigint                      | not null |
 status         | order_status                | not null | 'pending'::order_status
 total_cents    | bigint                      | not null |
 placed_at      | timestamp with time zone    | not null | now()
 shipping_addr  | jsonb                       |          |
Indexes:
    "orders_pkey" PRIMARY KEY, btree (id)
    "orders_user_id_idx" btree (user_id)
    "orders_placed_at_brin" brin (placed_at)
Foreign-key constraints:
    "orders_user_fk" FOREIGN KEY (user_id) REFERENCES users(id)

You now know everything about this table without opening a migration file. That is the value of \d.


2) Schema design patterns — normalisation, types, constraints

The single biggest determinant of how your Postgres database behaves under load is the schema, and the single biggest determinant of schema quality is whether you used the right types. Postgres ships with a richer type system than almost any other open-source database, and using it well prevents whole categories of bugs that no amount of query tuning fixes later.

Start from third normal form. One fact in one place. The orders row carries user_id as a foreign key into users rather than embedding the email, because the user can change their email and you do not want to update a million order rows when they do. line_items lives in its own table rather than as a jsonb array on orders, because line items participate in aggregations (SUM(qty * unit_cents)), reports (per-SKU revenue), and audits, and a jsonb array would force every such query into a function call.

But denormalise deliberately when the read-write ratio justifies it. Many e-commerce schemas store a materialised total_cents on orders even though it is technically derivable from line_items. The reason: every checkout, order list, and revenue dashboard reads this number; recomputing it from line_items would be 10× more I/O. The trade is that you must keep it consistent — typically via a trigger or an application-layer invariant — and accept that drift will eventually happen and need reconciliation.

The Postgres-specific types that pay off the most:

  • jsonb for semi-structured payloads with unstable shape. Our shipping_address is jsonb because the shape varies by country (India needs PIN code and state, US needs ZIP, Japan needs prefecture). Index with GIN on the whole column for containment queries (@>), or use generated columns to extract specific fields into typed columns with their own B-tree indexes. The Stripe Sync Engine, contributed to by Stripe engineers, uses exactly this pattern: store the full Stripe object as jsonb, expose hot fields as generated columns, index those.
  • Arrays (text[], int[]) for small bounded sets. Tags on a product, role names on a user. Crosses the line into anti-pattern when the array grows unbounded — at that point it should be a child table.
  • Enums (order_status) for fixed-vocabulary columns with strong type safety. CREATE TYPE order_status AS ENUM ('pending','paid','shipped','cancelled'). Enums are faster to compare than text and prevent typos at insert time. They have one operational sharp edge: adding a value is cheap, but removing or reordering values rewrites every table that uses them, which is why many shops prefer a lookup table for vocabularies expected to churn.
  • citext for case-insensitive uniqueness. Our users.email uses citext so Alice@Foo.com and alice@foo.com collide on the unique index without LOWER() shenanigans in every query.
  • timestamptz always over timestamp. The "tz" version stores UTC and converts on the way out based on SET timezone. The plain timestamp type silently drops timezone information and is the source of every "off-by-five-hours" production bug.
  • Generated columns for derived facts: total_cents bigint GENERATED ALWAYS AS (subtotal_cents + tax_cents) STORED. The database guarantees the invariant; the application cannot accidentally write a wrong value.

On constraints, the rule is: encode every invariant the database can enforce, in the database. CHECK (total_cents >= 0) catches the negative-total bug at insert. FOREIGN KEY (user_id) REFERENCES users(id) prevents orphan orders. EXCLUDE USING gist (room_id WITH =, during WITH &&) on a bookings table prevents double-booking the same conference room — a constraint impossible to express in MySQL without serialising every booking through application code.

Mini-FAQ. "Should I use jsonb for everything to keep the schema flexible?" No, and the people who advocate for it have not paid the cost yet. jsonb defeats foreign-key checks, partial indexes, column-level statistics, and column-level access control. Use it where the shape genuinely varies and the data is read as a blob. Use real columns for everything that gets filtered, joined, or aggregated.


3) Common query antipatterns

Most slow queries are not exotic. They are one of five shapes, and once you recognise them in a code review you will catch them before they reach production.

The N+1. The Django ORM fetches 100 orders, then iterates them and accesses order.user, triggering 100 follow-up SELECT * FROM users WHERE id = ?. Total: 101 queries. Discovered in production via pg_stat_statements — look for queries with extremely high calls and low rows. The fix is select_related('user') (Django), Includes(:user) (Rails), or in raw SQL a JOIN or WHERE user_id = ANY($1). On a high-latency replica connection, switching from N+1 to a single CTE has been measured at ~100× improvement (12 ms vs 45 ms over 50,000 comments on PG16).

LIKE '%foo%' without trigram. A B-tree index on email cannot help a leading-wildcard search; the planner falls back to Seq Scan. The fix is pg_trgm: CREATE INDEX users_email_trgm ON users USING gin (email gin_trgm_ops). Now email LIKE '%@gmail.com' uses the index. The trade: the GIN index is 3× the build cost and amplifies writes, so add it only on columns that genuinely need substring search.

OR across unrelated columns. WHERE email = $1 OR phone = $2 cannot use both indexes simultaneously; the planner picks one or falls back to Seq Scan. The fix is UNION ALL of two separately-indexed subqueries.

IN with thousands of values. WHERE id IN ($1, $2, ... $5000) slows linearly. Past ~100 values, switch to WHERE id = ANY($1::bigint[]) and pass an array; past ~10,000 values, drop the list into a temp table or a VALUES clause and join.

Counting with SELECT COUNT(*) on a large table. Postgres has no metadata count; every count(*) reads every visible tuple. For dashboards, use pg_class.reltuples as an estimate, or maintain a materialised count via trigger.

For our threaded example: imagine "show me this user's last 20 orders with line item counts." The N+1 version issues 21 queries. The right version is one statement with a LEFT JOIN LATERAL or a CTE. We will write it in §5.


4) Reading EXPLAIN (ANALYZE, BUFFERS) without panic

Open a slow query with EXPLAIN (ANALYZE, BUFFERS) <query>. ANALYZE actually executes the query and reports real timings; BUFFERS reports cache hits and disk reads. The output looks intimidating; it is not.

Read it inside-out. The most-indented node runs first. Each node prints:

Node Type (cost=A..B rows=R width=W) (actual time=X..Y rows=AR loops=L)
  Buffers: shared hit=H read=R dirtied=D written=Wr

The two numbers that matter most are actual rows vs estimated rows and shared hit vs shared read. If estimated rows is 1 and actual rows is 4 million, the planner has bad stats and probably picked the wrong join strategy — ANALYZE the relevant tables. If shared read is large and shared hit is small, you are doing disk I/O the cache could be servicing — either the working set is bigger than shared_buffers, or the query is touching cold data.

The node types you must recognise:

EXPLAIN (ANALYZE, BUFFERS) for the user-orders join
─────────────────────────────────────────────────────

Hash Join  (cost=12.5..3450.7 rows=200 width=84) (actual time=0.4..18.2 rows=187 loops=1)
  Hash Cond: (o.user_id = u.id)
  Buffers: shared hit=412 read=18
  ->  Index Scan using orders_user_id_idx on orders o
        (cost=0.4..3420.0 rows=200 width=60) (actual time=0.03..17.4 rows=187 loops=1)
        Index Cond: (user_id = 9001)
        Buffers: shared hit=395 read=18
  ->  Hash  (cost=8.0..8.0 rows=1 width=24) (actual time=0.05..0.05 rows=1 loops=1)
        Buffers: shared hit=2
        ->  Index Scan using users_pkey on users u
              (cost=0.3..8.0 rows=1 width=24) (actual time=0.04..0.04 rows=1 loops=1)
              Index Cond: (id = 9001)
              Buffers: shared hit=2
Planning Time: 0.3 ms
Execution Time: 18.5 ms
  • Seq Scan — full table read. Fine for tables under ~10k rows or queries that touch >15% of rows; suspicious otherwise.
  • Index Scan — descend a B-tree, fetch matching heap tuples. The bread and butter.
  • Index-Only Scan — answer from the index alone, no heap visit. Requires the visibility map to say the page is all-visible (so recent vacuum). Watch for Heap Fetches: N — if N is non-zero, the visibility map is stale and you are paying heap I/O anyway.
  • Bitmap Heap Scan + Bitmap Index Scan — for moderate-selectivity predicates: build a sorted bitmap of matching ctids, then visit heap pages in physical order. Two to ten times faster than naive Index Scan when matches are scattered.
  • Nested Loop — for each row on the outer side, probe the inner side. Wins when the outer side is small; catastrophic when the outer side is 4M rows.
  • Hash Join — build a hash of the smaller side in memory, probe with the larger side. Wins on equi-joins with mid-size inputs.
  • Merge Join — both sides sorted, walk in lockstep. Wins when inputs are already sorted (e.g., both have an index on the join key).

Shailesh Mishra's published case study walks through a checkout query that went from 50 ms to 1.2 s; the diagnosis came from BUFFERS showing 240,000 shared read instead of shared hit after a partition flip changed which pages were hot. One word in EXPLAIN diagnosed what three days of network debugging had missed. pganalyze's nested-loop write-up shows the inverse pattern: an inner-side index miss inflating Buffers: read per loop, multiplied by 10,000 outer rows.

Teacher voice. Read EXPLAIN with two columns of attention: estimated-vs-actual to catch planner errors, and hit-vs-read to catch I/O. Everything else is decoration.


5) Subqueries, CTEs, window functions

For the user-with-line-item-counts query, three legitimate Postgres tools exist:

Subquery in SELECT (correlated):

SELECT o.id, o.status,
       (SELECT count(*) FROM line_items li WHERE li.order_id = o.id) AS items
FROM orders o WHERE o.user_id = 9001 ORDER BY o.placed_at DESC LIMIT 20;
Each outer row runs an inner query. Twenty orders, twenty subqueries. Fine at this scale; bad at 10,000.

CTE with join:

WITH user_orders AS (
  SELECT id, status, placed_at FROM orders
  WHERE user_id = 9001 ORDER BY placed_at DESC LIMIT 20
)
SELECT uo.id, uo.status, count(li.id) AS items
FROM user_orders uo LEFT JOIN line_items li ON li.order_id = uo.id
GROUP BY uo.id, uo.status, uo.placed_at ORDER BY uo.placed_at DESC;
Single round trip. The CTE materialises the 20 orders, then one indexed lookup per order on line_items.order_id. This is the Disqus-style threaded-comment pattern.

Window function:

SELECT id, status, placed_at,
       row_number() OVER (PARTITION BY user_id ORDER BY placed_at DESC) AS rn
FROM orders WHERE user_id = 9001;
Window functions compute over a sliding frame without collapsing rows. They shine for ranking ("top 3 orders per user"), running totals, lead/lag comparisons. Citus Data and Crunchy Data both publish production analytics patterns built almost entirely on window functions. PG15 made rank(), row_number(), and count() window evaluation materially faster.

The materialisation gotcha: pre-PG12, every CTE was an optimisation fence — Postgres materialised it as a temp result even when inlining would have been cheaper. PG12+ inlines simple non-recursive CTEs by default; force the old behaviour with WITH x AS MATERIALIZED (...) when you want to compute once and reuse, or with NOT MATERIALIZED to force inlining. AWS PG team's PGConf 2022 talk goes deep on when the inline-or-not decision flips the plan.

For our threaded UPDATE example, the surrounding analytics query is: "for user 9001, show the last 20 orders with item counts and a running total of spend." That is a JOIN + window function + CTE all at once, and writing it cleanly is the staff-level skill.

Mini-FAQ. "CTE or subquery?" CTE wins on readability and on multi-reference cases. Subquery wins on simple single-use cases where the planner does a better job inlining. Window function wins anywhere you would otherwise self-join the table to itself.


6) Bulk operations — COPY, batched upsert, UPDATE FROM

Row-by-row INSERT from an application is the slowest way to load data into Postgres. The Postgres docs benchmark COPY at roughly 4× faster than batched INSERT and ~50× faster than single-row INSERT in autocommit mode. Real numbers: one production case loaded 35M rows in 260 s via COPY versus 1,100 s via batched INSERT.

For your own bulk loads, the choices are:

  • Server-side COPY orders FROM '/tmp/data.csv' CSV HEADER — fastest, but requires filesystem access on the DB host. Not available on RDS, Heroku Postgres, Supabase, or any other managed service.
  • Client-side \copy orders FROM 'data.csv' CSV HEADER in psql — streams through the client connection. Slightly slower than server-side COPY but works on every managed service. The AWS RDS docs recommend this as the canonical bulk-load path.
  • Batched INSERT ... VALUES (...), (...), (...) ON CONFLICT DO UPDATE — use this for upserts where the row may already exist. Batch sizes of 500–5,000 rows per statement balance memory and parser overhead.
  • UPDATE orders SET status='cancelled' FROM unnest($1::bigint[]) AS t(id) WHERE orders.id = t.id — bulk update from an array; one statement updates thousands of rows.
  • DELETE FROM orders USING expired_orders e WHERE orders.id = e.id — bulk delete via a join.

Two production pitfalls. First, large transactions block autovacuum. A 10M-row COPY inside one transaction holds a snapshot for 20 minutes, during which dead tuples accumulate everywhere. The fix is to chunk: COPY 100k rows per transaction. Second, INSERT ... ON CONFLICT DO UPDATE takes a row-level lock per attempted insert; on a hot key, concurrent upserts serialise. Watch pg_stat_activity for Lock | tuple waits.

For our example, suppose accounting needs to mark 50,000 stale "pending" orders as "cancelled" overnight. The shape is:

WITH stale AS (
  SELECT id FROM orders
  WHERE status='pending' AND placed_at < now() - interval '30 days'
)
UPDATE orders SET status='cancelled'
WHERE id IN (SELECT id FROM stale);

One statement. One snapshot. Vacuum cleans up the dead tuples afterward.


7) Comparison table — same query, different schema/index choices

All numbers below are Postgres 16 on AWS RDS db.r6g.large (2 vCPU, 16 GB RAM, gp3 SSD), 50M-row orders, 200M-row line_items, warm cache unless stated. The query is "show user 9001's last 20 orders with item counts."

Schema/index shape Plan Rows fetched Buffers (hit/read) Latency (warm) Latency (cold)
No index on orders.user_id, no index on line_items.order_id Seq Scan + Hash Join 50M + 200M 850k hit / 30k read 4.2 s 38 s
B-tree on orders.user_id only Index Scan + Nested Loop on line_items 187 + 200M 320k hit / 12k read 1.8 s 12 s
B-trees on both join columns Index Scan + Index Scan in Nested Loop 187 + 612 412 hit / 18 read 18 ms 95 ms
Above + INCLUDE (status, placed_at) covering index Index-Only Scan + Index Scan 187 + 612 240 hit / 6 read 11 ms 60 ms
Above + BRIN on placed_at (partition pruning) Index-Only Scan + Bitmap Heap 187 + 612 195 hit / 4 read 9 ms 45 ms

The jump from row 1 to row 3 is six orders of magnitude on warm cache. The jump from row 3 to row 5 is small in absolute terms but visible on the p99 of a checkout API. The lesson: most query speedups come from the first correct index; further speedups come from covering indexes and avoiding the heap; the last 30% comes from BRIN-style locality on physically-ordered columns.


Where this lives in the wild

These come from live engineering blogs, vendor docs, and case studies. Two categories — schema/type choices at production scale, and EXPLAIN/query-plan diagnosis stories.

Schema, types, and constraints in production:

  • Stripe Sync Engine — uses jsonb storage with generated columns to balance Stripe API shape flexibility with indexed query performance on the synced Postgres tables.
  • Supabase — exposes citext, jsonb, pgvector, and dozens of other extensions as one-click enables; every project starts with a strict-typed Postgres rather than a generic blob store.
  • Notion — models every block (text, image, page, database row) as a row in a single block table; sharded as schema001.block, schema002.block across 480 logical shards on 32 physical instances.
  • Heroku Postgres — runs hundreds of thousands of customer databases with PG9.4+ as the floor, specifically to guarantee jsonb availability for every customer schema.
  • GitLab — uses partial unique indexes and CHECK constraints extensively across its 12 TiB monolith schema; query optimisation bot reviews schema changes before they merge.
  • PostGIS / OpenStreetMap — uses GiST exclusion constraints and geometry types to prevent overlapping land-parcel records at the database level rather than the application.
  • Crunchy Data customer guides — recommend jsonb @> with GIN indexes for flexible filter UIs and EXCLUDE USING gist for booking-system non-overlap invariants.
  • Datadog — stores host metadata in Postgres with composite B-tree indexes on (host_id, last_seen_at) to support 25,000 upserts/sec via INSERT ... ON CONFLICT DO UPDATE.
  • Cloudflare — runs internal Postgres for Stream and Images products with jsonb metadata columns indexed by GIN to support flexible product-side filter UIs.
  • Sentry — uses enum types for issue status (unresolved, resolved, ignored) and citext for user email uniqueness; ships autovacuum overrides per major write table.
  • Reddit (media metadata) — uses pg_partman for time-based partitioning of media rows, with check constraints encoding per-partition date ranges so the planner can prune partitions.
  • Supabase Stripe collaboration — pattern of storing the full Stripe object as jsonb and exposing hot fields (customer_id, status) as STORED generated columns with their own B-tree indexes.
  • Robinhood — uses covering indexes with INCLUDE clauses on brokerage tables so common dashboard queries answer from the index without heap visits.

EXPLAIN, query plans, and bulk operations:

  • GitLab Database Lab (Joe Bot) — runs EXPLAIN (ANALYZE, BUFFERS) against production-cloned data for every database-touching MR, surfacing plan regressions before merge.
  • GitLab production — enables auto_explain at 1% sampling on replicas to capture slow-query plans without paying the overhead on every query.
  • Shailesh Mishra (e-commerce checkout case study) — diagnosed a 50 ms → 1.2 s checkout regression purely from BUFFERS shared-read counts; three days of network debugging had missed it.
  • pganalyze blog — published the canonical EXPLAIN-BUFFERS nested-loop write-up showing how inner-side cache misses multiply by outer-side loops.
  • postgres.ai (Database Lab) — productised the "EXPLAIN against a thin clone of production data" workflow that GitLab made famous.
  • Mattermost — published a "1000× faster Postgres query" case study via EXPLAIN-driven index addition on a posts filter query.
  • Crunchy Data analytics blog — production patterns for window-function-driven dashboards on Postgres, including PG15 rank/row_number speedups.
  • Citus Data blog — recursive CTE patterns for hierarchical data (trees, threaded comments), benchmarked at ~100× faster than N+1 over high-latency connections.
  • Disqus — uses recursive CTEs over a parent-pointer column to fetch threaded comment trees in one query rather than N+1.
  • Heroku Postgres bulk-load docs — recommend client-side \copy for managed-database imports since the server filesystem is inaccessible.
  • AWS RDS Postgres documentation — explicitly recommends \copy (psql client-side) over COPY ... FROM '/path' because customers cannot reach RDS filesystem paths.
  • One-off production backfill case study (denhox.com) — chunked a 9M-row \copy to avoid statement-timeout on the production server, with UNLOGGED-staging table for atomic swap.
  • pgMustard — productised EXPLAIN-plan visualisation that highlights the buffer-read hotspots a human reader would miss in the raw text output.
  • TigerData (Timescale) blog — covering-index and BRIN patterns documented for time-series workloads; same query plans transfer to any append-mostly Postgres table.
  • OneUptime engineering blog — published COPY-vs-batched-INSERT benchmarks at 35M rows: 260 s vs 1,100 s on identical hardware.

Pause and recall

  1. Which two ~/.psqlrc settings would you call non-negotiable for production work, and why?
  2. When does jsonb plus a STORED generated column beat a plain typed column, and when does it lose?
  3. Name two ways to detect an N+1 pattern from inside psql without instrumenting the application.
  4. In EXPLAIN (ANALYZE, BUFFERS), which pair of numbers tells you the planner has bad stats, and which pair tells you you are doing cold-cache I/O?
  5. Why is a Nested Loop a great choice in some plans and a catastrophe in others?
  6. What does WITH x AS MATERIALIZED (...) do that plain WITH x AS (...) does not, on Postgres 12+?
  7. On AWS RDS, why is \copy the only practical bulk-load path?
  8. For our threaded UPDATE orders SET status='paid' WHERE id=1234, which EXPLAIN plan would you expect, and which two Buffers numbers tell you the index actually helped?

Interview Q&A

Q1. Walk me through how you would diagnose "this query was fast yesterday, slow today" without a profiler. A. Reproduce the query under EXPLAIN (ANALYZE, BUFFERS). Compare estimated vs actual rows on each node — a 100× mismatch points at stale statistics; run ANALYZE on relevant tables. Compare shared hit vs shared read — high reads suggest the working set fell out of cache, often because a long transaction blocked autovacuum and bloat pushed pages out. Then check pg_stat_user_tables.n_dead_tup and pg_stat_activity for long-running transactions. Plan flips around the selectivity crossover are the common root cause; CREATE STATISTICS for correlated columns is the fix. Common wrong answer to avoid: "Add an index." Adding indexes without first confirming the plan change wastes write throughput and often doesn't help.

Q2. When would you reach for jsonb versus a normalised child table? A. jsonb wins when the shape genuinely varies per row, the data is read mostly as a blob, and only a few fields need indexing (which you handle with generated columns or GIN). A child table wins when the data has stable shape, participates in aggregations or joins, or needs foreign-key integrity. The Stripe Sync Engine pattern — store the API object as jsonb, expose hot fields as STORED generated columns with B-tree indexes — is the production hybrid. Common wrong answer to avoid: "Always use jsonb for schema flexibility." jsonb defeats foreign keys, column statistics, and column-level access control; flexibility has a real cost.

Q3. Explain what EXPLAIN (ANALYZE, BUFFERS) adds over plain EXPLAIN. A. Plain EXPLAIN shows the planner's estimated plan. ANALYZE actually executes the query and reports real timings and row counts per node, exposing planner misjudgements. BUFFERS reports shared-buffer hits, disk reads, dirty pages, and writes per node, exposing I/O bottlenecks the timing alone might attribute to "the query is just slow." A query with 18 ms execution but 250k shared reads is a cold-cache problem, not a plan problem; you can't tell that without BUFFERS. Common wrong answer to avoid: "ANALYZE gives row counts." It gives much more — actual timings, loops, and the shape mismatch with estimates that is the most useful planner-debugging signal.

Q4. A teammate says CTEs are always slower than subqueries in Postgres. Right or wrong? A. Wrong on PG12+. Before PG12, every CTE was an optimisation fence — materialised even when inlining would be cheaper. PG12 inlines simple non-recursive CTEs by default, so most CTEs now plan identically to the equivalent subquery. You can force the old behaviour with MATERIALIZED (useful when a CTE is referenced multiple times and recomputation is expensive) or with NOT MATERIALIZED to override the default for recursive cases. Common wrong answer to avoid: "CTEs are always materialised." That was true through PG11 and the muscle memory persists, but PG12+ changed the rule.

Q5. How would you bulk-update 50,000 rows that match a complex predicate without holding a long transaction? A. Compute the target ids first into a staging table or temp table, then update in chunks of, say, 5,000 ids per transaction, sleeping briefly between batches to let autovacuum keep up. Use UPDATE orders SET ... WHERE id = ANY($1::bigint[]) with an array parameter; this gives the planner a single executable plan rather than 50,000 prepared statements. Never wrap the whole thing in one transaction on a busy table — the snapshot would block autovacuum across the entire database for the duration. Common wrong answer to avoid: "Just run one big UPDATE in a transaction." It works on a quiet system; on a busy one it causes bloat and replica lag.

Q6. What is wrong with WHERE id IN (1, 2, ..., 50000) and how do you fix it? A. The parser builds an in-list of 50,000 nodes; planning becomes the bottleneck, prepared-statement caches blow up, and the EXPLAIN output is unreadable. Fix one of two ways: pass an array — WHERE id = ANY($1::bigint[]) — which Postgres treats as a single parameter and plans once; or, for very large sets, insert the ids into a temporary table and JOIN against it, letting the planner pick a hash or merge join based on cardinality. Common wrong answer to avoid: "It's the same as a small IN list, just slower." It is qualitatively different — planning cost is non-linear, and statement caches behave badly.

Q7. Walk me through a \copy vs COPY decision for a 10M-row import into RDS Postgres. A. On RDS the database server filesystem is not accessible to the customer, so server-side COPY FROM '/path' is not an option. Use client-side \copy from a psql session on an application host with network access to RDS. Chunk the file at the application level (e.g., 100k rows per statement) so each batch fits inside the default statement_timeout and doesn't hold a snapshot that blocks autovacuum. Consider staging into an UNLOGGED table and swapping with ALTER TABLE ... SET LOGGED after — saves on WAL volume during load. Common wrong answer to avoid: "Use COPY for speed." It is faster but unavailable on managed services; the realistic comparison on RDS is \copy vs batched INSERT, and \copy wins by ~4×.

Q8. When would you add a covering (INCLUDE) index, and what are the costs? A. Add INCLUDE columns when a frequently-run query selects a small fixed set of columns alongside the indexed predicate, and the table is heavily-vacuumed enough that the visibility map keeps Index-Only Scans cheap. The benefit is the heap visit is skipped entirely. The costs: the index gets larger (so cache pressure rises), every UPDATE that touches an INCLUDE column must update the index entry (so writes amplify), and stale visibility map turns Index-Only Scans back into Index Scans with Heap Fetches: N. Robinhood and TigerData both publish production guides on choosing which columns to include. Common wrong answer to avoid: "Include all the columns the query selects." Big indexes hurt write throughput and cache pressure; include the minimum set that the hot query actually needs.


Apply now (10 min)

Step 1 — model the diagnosis. Here is what one row of a "slow query triage log" looks like for our running example. Copy the shape.

Field Value
Query SELECT o.* FROM orders o WHERE o.user_id = 9001 ORDER BY placed_at DESC LIMIT 20
Reported latency 1.8 s p95
EXPLAIN plan Index Scan on orders_user_id_idx → Sort → Limit
Estimated vs actual rows est 200, actual 1,847 (9× off)
Buffers: shared hit=18,400 read=2,100
Diagnosis Stats stale after bulk load; planner picked Index Scan but post-sort is expensive
Fix ANALYZE orders; add covering index (user_id, placed_at DESC) INCLUDE (status, total_cents)
Re-measured 11 ms p95, hit=240 read=6

Step 2 — your turn. Pick a query in your own service that runs more than once per second. Run EXPLAIN (ANALYZE, BUFFERS) on it. Fill the row above with your numbers. For at least one node, write down (a) the estimated-vs-actual row mismatch, (b) hit-vs-read ratio, (c) whether the join strategy looks right for the cardinalities. If anything looks off, propose one change and re-measure.

Step 3 — sketch from memory. Close this page. On paper, draw the Hash Join plan diagram from §4 — the inside-out tree of nodes with their cost, actual time, and Buffers annotations. Label which two number-pairs you would look at first to decide "planner problem" versus "I/O problem." If you can produce this cold, you can read any EXPLAIN output an interviewer hands you.


Bridge. Now you can sit at psql, design a schema that holds, write queries that don't N+1, and read plans without panic. What you cannot yet do is keep a busy cluster healthy under continuous write load — that is the autovacuum, replication, and connection-pooling story. → 03-vacuum-replication-connection-pooling.md