03. TLS, caching, rate limiting — the production surface¶

~12 min read. nginx in development is forgiving. nginx in production reveals TLS subtleties, cache invalidation pain, rate-limit tuning, and a long tail of "why is this slow" diagnostics. This chapter is the operational catalogue.

Builds on: 02-configs-locations-day-to-day.md.

The previous chapter showed the config surface. This chapter is what production teaches — TLS termination at scale, caching with correct invalidation, rate limiting against bursts, and the gotchas that surface only under load.

1) TLS termination — the right defaults¶

server {
    listen 443 ssl http2;
    server_name example.com;

    ssl_certificate     /etc/letsencrypt/live/example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;

    # Protocols and ciphers
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_prefer_server_ciphers off;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305;

    # Session caching — avoid handshake on reconnect
    ssl_session_cache shared:SSL:50m;
    ssl_session_timeout 1d;
    ssl_session_tickets off;

    # OCSP stapling — clients don't have to ask the CA
    ssl_stapling on;
    ssl_stapling_verify on;

    # HSTS
    add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
}

Three details that matter:

Drop TLS 1.0 and 1.1. Modern browsers don't support them; keeping them open is risk without benefit.
ssl_session_cache shared:SSL:50m. A handshake costs 1-2 RTT plus crypto. With session caching, subsequent connections from the same client skip the handshake. The cache holds about 4000 sessions per MB; 50 MB covers ~200K sessions.
OCSP stapling. Without it, the client has to query the CA's OCSP responder to verify the certificate isn't revoked — adds 100-500ms to first connection. With stapling, nginx fetches OCSP responses periodically and includes them in the handshake.

For Let's Encrypt, certbot --nginx handles the cert renewal cron. For wildcard certs or non-standard CAs, manage renewal explicitly.

2) HTTP/2 and HTTP/3¶

listen 443 ssl http2; enables HTTP/2. Benefits:

Multiplexed streams over one TCP connection — no head-of-line blocking at the HTTP layer.
Server push (deprecated in browsers; ignore).
Header compression (HPACK).

For HTTP/3 (QUIC over UDP):

listen 443 quic reuseport;
listen 443 ssl http2;
add_header Alt-Svc 'h3=":443"; ma=86400';

HTTP/3 requires nginx 1.25+ compiled with QUIC support. The Alt-Svc header tells clients HTTP/3 is available. Browsers will switch on next connection. Useful for lossy mobile networks; less critical for low-loss wired connections.

3) Caching — the proxy_cache layer¶

nginx can cache responses from upstream:

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=app_cache:10m
                 max_size=10g inactive=60m use_temp_path=off;

server {
    location / {
        proxy_pass http://app;
        proxy_cache app_cache;
        proxy_cache_key "$scheme$host$request_uri";
        proxy_cache_valid 200 301 10m;
        proxy_cache_valid 404 1m;
        proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
        proxy_cache_background_update on;
        proxy_cache_lock on;

        add_header X-Cache-Status $upstream_cache_status;
    }
}

Key directives:

proxy_cache_key — the cache key. Default includes URI; add $cookie_user for per-user keys if needed.
proxy_cache_valid — TTL per response status.
proxy_cache_use_stale — serve stale content from cache when upstream fails. Resilience pattern.
proxy_cache_background_update — refresh the cache in the background while serving stale; user sees fresh-or-stale, never waiting on the refresh.
proxy_cache_lock — when cache is cold, only one request goes to upstream; others wait for the response. Prevents thundering herd on cold cache.

The X-Cache-Status response header reveals what nginx did: HIT, MISS, STALE, UPDATING, EXPIRED, BYPASS. Essential for debugging cache behaviour.

Invalidation. nginx has no built-in invalidation. Patterns:

Short TTLs and tolerance for staleness.
Cache key versioning — bump a version in the key when content changes.
External purge — nginx-cache-purge or the commercial nginx Plus's API. Some teams use Varnish for the cache layer when invalidation is critical.
Volume-based purge — find /var/cache/nginx -type f -mmin +60 -delete from cron.

4) Rate limiting¶

nginx supports per-IP (or per-anything) rate limits:

limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req_zone $http_authorization zone=auth_api:10m rate=100r/s;

server {
    location /api/ {
        limit_req zone=api burst=20 nodelay;
        # ...
    }

    location /api/auth/ {
        limit_req zone=auth_api burst=5 nodelay;
        # ...
    }
}

rate=10r/s — 10 requests per second sustained.
burst=20 — allow short bursts up to 20 above the sustained rate.
nodelay — don't delay burst requests; reject if burst exceeded. Without it, nginx queues bursts and drains at the sustained rate.

Zone keys can be any nginx variable: $binary_remote_addr (per IP), $http_authorization (per token), $arg_api_key (per API key in URL), $request_uri (per endpoint).

Connection limits (limit_conn_zone) cap concurrent connections per key — useful against slowloris-class attacks where one IP holds many slow connections.

limit_conn_zone $binary_remote_addr zone=perip:10m;
limit_conn perip 10;

The trade-off: rate limits punish legitimate bursts (mobile apps that batch on resume); too-permissive limits don't punish abuse. Tune per endpoint.

5) Logging at scale¶

Production logs grow fast. A 1000-RPS app produces 86M lines per day; access logs alone can be 30 GB/day. Patterns:

Log rotation. logrotate or systemd's journald handles size-based and time-based rotation. Configure nginx -s reopen after rotation (SIGUSR1 to nginx).
JSON logs. Easier to parse, ship, and query in aggregators (Elastic, Loki, ClickHouse).
Conditional logging. Skip routine health checks (access_log off; inside location = /healthz).
Sampling. For very high RPS, log 1-in-N requests with if ($request_id ~ "^.{30}0") (or similar). Sample but capture all errors.
Centralised shipping. journald, Fluent Bit, Vector, or Filebeat ship logs to the aggregator.

6) The 502, 504, and 499 — distinguishing failure modes¶

Three error codes that nginx returns on upstream failure:

502 Bad Gateway — connection to upstream failed (refused, reset).
504 Gateway Timeout — connection to upstream succeeded but didn't respond in proxy_read_timeout.
499 Client Closed Request — the client gave up before upstream responded. nginx-specific; not in the HTTP RFC.

The distinguishing pattern:

502 spike → upstream is down or unreachable; check upstream health.
504 spike → upstream is up but slow; check upstream's per-request time.
499 spike → clients are timing out; could be client-side (mobile networks) or server-side (responses too slow).

The 499 is sneaky: it doesn't mean nginx or upstream failed; it means the request was wasted because the client left. High 499 rates often signal that upstream latency has crossed clients' patience threshold. Fixing upstream latency drops 499s.

7) Slowloris and connection-level defences¶

A slowloris attack opens many connections, sends headers slowly, never completes. Workers fill up; legitimate users can't connect.

Defences:

client_header_timeout 10s;     # max time to read request headers
client_body_timeout 10s;       # max time to read request body
send_timeout 10s;              # max time between writes to client
limit_conn perip 10;           # max concurrent connections per IP

Combined with the buffering layer (chapter 01) and the event-loop model that doesn't waste a thread per slow connection, nginx is relatively resistant to slowloris by default. The above timeouts shore up the worst cases.

8) Geo-blocking and the `geo` module¶

geo $allowed_country {
    default 0;
    1.0.0.0/24 1;      # specific IP ranges
    103.0.0.0/8 1;     # India
    49.0.0.0/8 1;
}

server {
    location /admin/ {
        if ($allowed_country = 0) {
            return 403;
        }
        # ...
    }
}

The geo directive maps IPs to a variable at config time. For dynamic geo-IP lookups, use geoip2 module (newer) or geoip (legacy). Common uses: admin endpoint access restriction, content variation by region, regulatory compliance.

9) Health checks and graceful shutdown¶

location = /healthz {
    access_log off;
    return 200 "ok\n";
    add_header Content-Type text/plain;
}

location = /ready {
    access_log off;
    # Check upstream actually responds — proxy to a real path
    proxy_pass http://app/health;
    proxy_read_timeout 5s;
}

/healthz is nginx-side: always-up, returns 200 if nginx itself is alive. Used by load balancer health checks.

/ready (or /readyz) is upstream-side: returns 200 only if upstream is reachable. Used by orchestrators (Kubernetes) to decide whether to route traffic. Distinguishing these two health endpoints is foundational for safe rolling deploys.

Graceful shutdown for nginx:

nginx -s quit

vs. the brutal:

nginx -s stop

quit lets workers drain in-flight requests before exiting. stop kills immediately. Always use quit in production. The orchestrator should send SIGQUIT on shutdown; set terminationGracePeriodSeconds in Kubernetes to allow time for drain.

10) Observability — the metrics that matter¶

Per-request:

Status code distribution.
Per-endpoint latency (p50, p95, p99).
$upstream_response_time vs. $request_time (the slice attributable to nginx).
Cache hit rate per cache zone.

System-level:

Active connections (stub_status module).
worker_connections utilisation.
TLS handshake rate.
5xx rate per upstream.
Connection-pool utilisation per upstream.

The nginx stub_status module exposes:

location = /nginx_status {
    stub_status;
    access_log off;
    allow 10.0.0.0/8;     # private network
    deny all;
}

Returns active connections, accepts/handled/requests counters, reading/writing/waiting states. Pair with Prometheus exporter (nginx-prometheus-exporter) for time-series.

For deeper inspection, OpenTelemetry's nginx module emits spans per request, integrating with downstream tracing.

Operational signals¶

Healthy. TLS handshake rate matches new-connection rate; cache hit rate steady; 5xx rate < 0.1%; worker_connections utilisation < 50%; reload succeeds with nginx -t passing.

First degrading metric. 504 rate climbing → upstream is slow. 502 rate climbing → upstream is unreachable. 499 rate climbing → clients are giving up.

Misleading metric. Aggregate latency without endpoint breakdown — a slow endpoint can hide in the average for weeks.

Expert graph. Per-endpoint status × latency heatmap; the cell that lights up is the next investigation.

Where this appears in production¶

Cloudflare — nginx (heavily customised) at edge; OCSP stapling and session caching as defaults.
Netflix — extensive use of proxy_cache_use_stale for graceful degradation.
Discord — rate limiting per-token via limit_req_zone $http_authorization.
GitHub — nginx + custom modules for the Git smart-HTTP layer; large buffer tuning.
Stripe API — nginx as the TLS edge with strict cipher selection.
A Mumbai e-commerce site — proxy_cache for product listing pages with 5-minute TTL; per-user bypass via cookie.
A Bengaluru fintech — limit_req per API key, separate zones for read and write endpoints.
A Pune SaaS — proxy_cache_use_stale enabled during upstream maintenance; user experience preserved.

Recall / checkpoint¶

What is ssl_session_cache and what does it save?
What is OCSP stapling and what does it remove from the handshake?
What is proxy_cache_use_stale and why is it a resilience pattern?
How does limit_req differ from limit_conn?
What distinguishes 502 from 504 from 499?
What is the difference between /healthz and /ready?
Why is nginx -s quit preferred over nginx -s stop?

Interview Q&A¶

Q1. The team is seeing a 499 rate spike. Walk through the diagnosis. 499 means clients closed the connection before nginx finished responding. Two common causes: upstream is slow (clients have a timeout; nginx is waiting on upstream; client gives up) or client networks are flaky (mobile clients on a poor connection). Diagnosis: correlate 499 spikes with $upstream_response_time. If upstream latency is up, fix upstream. If upstream is fine, look at client geography or app version — could be a client bug, a CDN issue, or a network event. Common wrong answer to avoid: "499 means nginx failed" — it means the client gave up; nginx is reporting the fact.

Q2. The cache is invalidating too aggressively; hit rate is 30%. What is the structural fix? Walk through the cache key. Likely proxy_cache_key includes a variable that changes per request (e.g., a tracking cookie, a session ID). The fix is to use a stable key — $scheme$host$request_uri for anonymous content, append $cookie_user only for per-user views. Validate by hitting the same URL twice and checking X-Cache-Status: HIT. Also verify proxy_cache_valid is appropriate for the content (10m vs. 1m). Common wrong answer to avoid: "raise the cache size" — won't help if the key is unique per request.

Q3. After a deploy, all clients are seeing a TLS handshake error. Walk through what could have changed. A handful of likely causes: a config change disabled a protocol or cipher the client uses (e.g., dropped TLS 1.0 still used by old API clients); the cert file is missing or unreadable; the cert chain is incomplete (fullchain.pem not used); the new server name was added without a matching certificate (SNI mismatch). Verification: openssl s_client -connect example.com:443 -servername example.com shows the exact failure. Common wrong answer to avoid: "TLS errors are always cert renewal" — often config changes, not certs.

Q4. The team wants to enable proxy_cache but is worried about stale data. Walk through the patterns. Cache responses that are tolerant of staleness (product catalog page, news article body) for short TTLs (10-300 seconds). Use proxy_cache_use_stale to serve stale content if upstream fails (the small staleness is better than an error). For per-user content, either skip cache (proxy_cache_bypass $cookie_user) or include user in the key. For invalidation, prefer short TTLs over external purge — purge is operationally complex and easy to get wrong. Common wrong answer to avoid: "cache everything for an hour" — invalidation accuracy matters more than TTL length.

Q5. Walk through the trade-off between nginx Plus, OpenResty, and stock nginx. Stock nginx: free, well-supported, sufficient for most workloads. OpenResty: nginx + Lua + extensive modules; useful when you need scriptable behaviour at the edge (auth, A/B testing, rate-limit logic in Lua). nginx Plus: commercial; adds active health checks, dynamic upstream reconfig via API, JWT auth modules, dashboards. Most production teams run stock nginx + custom modules where needed. OpenResty for high-customisation edges. nginx Plus for the management features when budget permits. Common wrong answer to avoid: "Plus is always better" — depends on what features you need that stock doesn't.

Q6. The team's worker_connections is at 80% utilisation. Walk through the response plan. First, diagnose: is it traffic growth or upstream slowness? Check $upstream_response_time distribution — if elevated, upstream is the constraint and raising worker_connections masks the symptom. If upstream is healthy and traffic genuinely grew, raise worker_connections (and worker_rlimit_nofile to at least 2× that), and ensure the underlying OS allows the FD count (ulimit -n for the process, systemd LimitNOFILE). Validate with stub_status showing the new ceiling. Add more nginx pods if vertical capacity is exhausted. Common wrong answer to avoid: "raise the limit and move on" — without diagnosis, you've masked an upstream issue.

Operational memory¶

This chapter explained the production surface of nginx: TLS termination patterns, caching with proxy_cache, rate limiting, error code interpretation (502/504/499), slowloris defences, health checks, and observability. The important idea is that nginx in production is largely about defending downstreams (upstream apps), defending against bad clients (slowloris, abuse), and providing the visibility to debug what's actually happening.

You learned to terminate TLS with the right defaults, cache responses with correct invalidation strategies, rate-limit per-key, distinguish upstream failure modes, and structure health checks for safe deploys. That completes the operational surface; nginx is now a production tool, not just a config to copy-paste.

Carry this diagnostic forward: when nginx is suspected in a production issue, ask which production surface is involved — TLS, cache, rate limit, upstream timeout, or worker saturation. Each has a known diagnostic path.

Remember:

TLS: drop 1.0/1.1; enable session cache; enable OCSP stapling.
proxy_cache: short TTLs + stale-on-error beat aggressive long-cache.
502 = upstream unreachable; 504 = upstream slow; 499 = client gave up.
/healthz is nginx; /ready is upstream; both belong on a deploy.
nginx -s quit drains; stop brutally terminates.
Cache key correctness > cache size > cache TTL.

Bridge. Django and nginx cover the request path. The next module — 06_celery — covers the background job path: tasks that don't fit in a request, retries, monitoring, the failure modes that surface only off the request thread. → ../06_celery/00-eli5.md