12. Connection pooling — too many people, too few counters¶

~14 min read. Databases slow down before queries do, because connection setup is expensive.

Built on the ELI5 in 00-eli5.md. The reservation desk — a limited number of desks serving a large crowd — helps many requests share a small, disciplined connection set.

1. A new connection does more work than most people imagine¶

See. Opening a database connection is not a tiny function call. It usually means a TCP handshake, maybe a TLS handshake, and authentication. Then the server allocates memory, process state, and bookkeeping. All of that happens before your actual query begins.

app thread
   │
   ├─→ TCP connect
   ├─→ TLS setup
   ├─→ auth check
   └─→ query starts

Now imagine this happening for every single request. Latency rises immediately. CPU burns on connection setup instead of useful queries. Database memory fills with idle sessions. Soon the system looks busy without real business work finishing. That is why pooling exists. Worked example. Suppose TCP plus TLS plus auth takes 18 milliseconds. The actual query takes only 6 milliseconds. Without pooling, total request time becomes 24 milliseconds. Connection setup is therefore 18 / 24 = 75% of the path. Most of the delay is ceremony, not data work. Simple, no? Databases also dislike huge connection counts. PostgreSQL gives each backend process real memory overhead. MySQL also spends per-session buffers and thread resources. Ten thousand mostly idle connections are not free insurance. They are pressure on RAM, scheduler, and failure recovery. The reservation desk keeps that crowd organized.

2. A pool reuses established sessions instead of reopening them¶

Look at the mental model. Your app has many request threads. The database should have fewer active connections than request threads. A connection pool sits between them. Threads borrow a ready connection, use it briefly, then return it. No fresh handshake happens on every request.

100 app requests ─→ [ pool of 20 live connections ] ─→ database

This does two useful things together. First, it cuts setup overhead sharply. Second, it limits concurrency at the database boundary. That second point matters a lot. Unlimited concurrency usually hurts databases more than it helps. The pool becomes a deliberate backpressure point. Suppose 200 web requests arrive together. Your pool size is 25. At most 25 can hit the database immediately. The other 175 must wait, fail fast, or use cached data. That sounds restrictive. Actually, it protects the database from collapse. A stable queue is better than a connection storm. One caution here. A pool is not a speed booster by itself. If queries are slow, a bigger pool may worsen things. You will simply run more slow queries at once. The reservation desk is for controlled reuse, not for hiding an unhealthy database.

3. Pooling tools solve different parts of the same problem¶

In Java applications, HikariCP is a common client-side pool. It lives inside the application process. Threads borrow JDBC connections from HikariCP directly. That gives low overhead and good tuning knobs. It is excellent when one service owns its own pool behavior.

PgBouncer is a PostgreSQL proxy pooler. It sits between many app instances and the database. Clients connect to PgBouncer. PgBouncer multiplexes them onto fewer PostgreSQL server connections. This is powerful when many app replicas would otherwise create huge session counts.

ProxySQL plays a similar role for MySQL ecosystems. It can pool, route reads and writes, and apply query rules. That makes it useful in large MySQL fleets. It is not only a pooler. It is also a traffic-control layer.

apps ─→ HikariCP or local pool ─→ PgBouncer / ProxySQL ─→ database

Now what is the subtle part? Not every pooling mode is identical. PgBouncer offers session pooling, transaction pooling, and statement pooling. Transaction pooling reuses server connections between transactions. That gives better fan-in. But session-level features may break unexpectedly. Temporary tables, prepared statements, and session variables need careful handling. So choose mode by compatibility, not only efficiency.

A simple rule helps. Use HikariCP for disciplined per-service borrowing. Use PgBouncer or ProxySQL when many app instances need global connection control. Sometimes you use both. That is common in large deployments.

4. Pool sizing is math first, guesswork second¶

Students ask for a magic number. There is no universal number. But there is a practical starting formula. Pool size should follow database capacity and query time, not only application thread count.

A useful first estimate is this.

pool size ≈ target concurrent DB work per instance
instance pool total ≤ safe DB connection budget

Let us make it concrete. Suppose the database safely handles 120 active connections. You run 6 application instances. Reserve 24 connections for admin jobs, migrations, and background workers. Usable budget becomes 120 - 24 = 96. Per-instance starting pool becomes 96 / 6 = 16. That is a sane first number.

Now add throughput thinking. Suppose each instance handles 320 requests per second. Only 25% of requests need the database. DB-bound request rate per instance is 320 × 0.25 = 80 per second. Average connection hold time is 40 milliseconds, or 0.04 seconds. Little's Law style concurrency estimate becomes 80 × 0.04 = 3.2. Round up for bursts, maybe to 8 or 10. That tells you a pool of 50 is unnecessary.

So what do we learn? One formula comes from global budget. Another comes from local concurrency demand. Choose the smaller sensible value, then load test. That is better than copying defaults blindly.

A practical checklist: - estimate active DB work per instance, - cap total across all instances, - leave headroom for ops, - and verify with real latency percentiles. Simple, no?

5. Connection limits, timeouts, and failure modes need discipline¶

Now what breaks in production? Pool exhaustion is the first classic problem. All connections are busy. New requests wait too long. Threads pile up. Latency stretches across the whole service. Then retries create even more pressure. Very dangerous spiral.

requests ↑ → pool wait ↑ → timeouts ↑ → retries ↑ → requests ↑

This is why timeout settings matter. Set a borrow timeout. Set a query timeout. Set an idle timeout. Set a max lifetime shorter than server-side connection killing. Otherwise you keep zombie or half-dead sessions around.

Suppose your pool has 20 connections. Average query time is 50 milliseconds. If all 20 are busy, approximate steady throughput ceiling is: 20 / 0.05 = 400 DB operations per second. If the app sends 700 DB operations per second, then backlog starts growing immediately. The fix is not automatically a pool of 35. Maybe the query needs indexing. Maybe the cache miss rate is too high. Maybe the service should batch writes.

Connection limits also force architecture conversations. If 40 microservices each want 30 direct connections, that is 1,200 potential sessions. Many databases will suffer there. A shared proxy layer becomes almost mandatory. The reservation desk exists so the crowd behaves. The reservation desk also tells some people to wait. That is a feature, not a bug.

Where this lives in the wild¶

Shopify Rails infrastructure engineer uses connection pools carefully because many web workers share one PostgreSQL fleet.
Stripe backend engineer uses PgBouncer-style pooling to avoid exploding PostgreSQL session counts during traffic bursts.
Uber Java service engineer tunes HikariCP pool sizes from query latency and instance concurrency, not thread counts alone.
GitLab database reliability engineer watches pool exhaustion and idle session buildup across many application nodes.
Pinterest MySQL platform engineer can use ProxySQL to pool, route, and protect database backends from client storms.

Pause and recall¶

Why can connection setup dominate total latency for short queries?
Why is a pool also a backpressure mechanism, not only a reuse trick?
When is PgBouncer or ProxySQL more useful than only HikariCP?
Why can a larger pool make a slow database feel worse?

Interview Q&A¶

Q: Why use a connection pool and not open one connection per request? A: New connections pay TCP, TLS, auth, and server allocation cost repeatedly. Reusing warm sessions cuts latency and protects the database from connection storms.

Common wrong answer to avoid: "Because opening connections is slightly slower" — the issue is not slight overhead. It can dominate latency and resource usage.

Q: Why choose PgBouncer and not only app-side pooling for PostgreSQL? A: PgBouncer helps many application instances share a smaller server-side connection set. That becomes critical when fleet-wide session counts would otherwise exceed PostgreSQL comfort limits.

Common wrong answer to avoid: "Because PostgreSQL has no native pooling" — the deeper reason is centralized control over aggregate connection fan-in.

Q: Why size a pool from database budget and query time, not from CPU cores alone? A: Pool demand depends on how many requests actually touch the database and how long each holds a connection. CPU count alone misses the database bottleneck entirely.

Common wrong answer to avoid: "Because more cores always mean more connections" — concurrency at the database boundary follows workload behavior, not only hardware shape.

Q: Why is pool exhaustion dangerous even when the database is still up? A: Waiting requests consume threads, trigger retries, and spread latency upstream. The system can enter a timeout storm before the database itself fully crashes.

Common wrong answer to avoid: "Because users see errors" — errors matter, but the broader risk is cascading failure through queues and retries.

Apply now (5 min)¶

Exercise: Your PostgreSQL cluster safely supports 90 application connections. You run 5 app instances and reserve 15 for background work. Compute the starting per-instance pool budget. Then assume each instance sends 60 DB-bound requests per second with 30-millisecond hold time. Estimate concurrency demand and compare both numbers.

Sketch from memory: draw request threads, one reservation desk, a small pool, and the database behind it. Then annotate where borrow timeout and query timeout act.

Bridge. Once systems share connections across many nodes, the next question is harder: when nodes disagree, who should the application trust? → 13-cap-theorem-in-practice.md