Skip to content

03. Networking, volumes, prod gotchas — Docker under load

~20 min read. The first two chapters built you an image and a compose stack that runs cleanly on a developer laptop. This one is what happens when the same stack meets a real network, a real disk, and a real OOM killer at 2 a.m. We keep the same ML inference service threaded through every section — now with a GPU, a 2 GB model volume, a healthcheck that flaps under memory pressure, and a Postgres that mysteriously drops connections after a routine deploy. By the end you will know which of the four network drivers to pick and why, which volume type to mount where, which five Docker bugs have actually paged on-call engineers at real companies, and how to harden a container so a single CVE does not become a host compromise.

Builds on: 00-eli5.md, 01-images-layers-oci-internals.md, and 02-dockerfile-compose-day-to-day.md.


1) The four network drivers — bridge, host, none, overlay

Every container Docker creates lives inside a network namespace. The driver is the policy that decides what that namespace looks like — which interfaces it owns, how packets reach the wider world, whether other containers can talk to it by name. Docker ships four built-in drivers. You should be able to draw each of them on a whiteboard and explain when you would pick one over the others.

The default bridge driver gives every container its own network namespace, drops a veth pair between that namespace and a Linux bridge called docker0 (or a user-defined bridge), and routes outbound traffic through host iptables NAT. On user-defined bridge networks — what compose creates per project — Docker also runs an embedded DNS resolver at 127.0.0.11 so containers can find each other by service name. This is the right default for almost every workload that is not chasing microseconds.

DRIVER 1: BRIDGE (default, user-defined)
────────────────────────────────────────
  host (kernel)                            container netns
  ┌──────────────────────┐                 ┌─────────────────────┐
  │  eth0  198.51.100.5  │                 │  eth0  172.18.0.4   │
  │   │                  │                 │   │                 │
  │   │ iptables NAT     │                 │   │ default route   │
  │   ▼                  │   veth pair     │   ▼                 │
  │  docker0 (bridge)────┼────────────────►│  to docker0         │
  │  172.18.0.1/16       │                 │                     │
  │   │                  │                 │  /etc/resolv.conf   │
  │   │ embedded DNS     │                 │  nameserver         │
  │   ▼ 127.0.0.11       │                 │     127.0.0.11      │
  │   └─────► upstream   │                 └─────────────────────┘
  └──────────────────────┘

The host driver skips the network namespace entirely. The container's processes bind directly to the host's interfaces — no veth pair, no NAT, no iptables hop, no embedded DNS. This buys you roughly 30-150 microseconds per packet on Docker 24.x, which is invisible for HTTP APIs and very visible for tight RPC loops, cache backends, and anything where p99 latency lives below a millisecond. The cost is that containers share the host's port space — two containers cannot both bind 8080 — and you lose the isolation guarantees that justify containers in the first place. A reported case from an HFT-adjacent system documented p99 dropping from 12 ms to 9 ms purely by switching bridge to host; for that team, three milliseconds was worth tens of thousands of dollars.

The none driver gives the container a network namespace with only a loopback interface. No external connectivity, no DNS, nothing. This sounds useless until you remember that running untrusted code is a real workload — a sandbox for user-submitted scripts, a one-shot batch job that should never call home, a forensics image you want to poke at without it phoning a C2 server.

The overlay driver stitches together container namespaces across multiple Docker hosts using a VXLAN tunnel. This is what swarm and (historically) some Kubernetes CNI plugins use to give cross-host containers the illusion of a flat L2 network. Overlay adds encapsulation overhead — every packet is wrapped in a VXLAN header before egress — and if you also enable encryption (IPsec) the penalty is significant. Empirical measurements have placed plain overlay roughly 8% slower than host on throughput but with 50% better latency consistency under load, which is exactly the trade-off video conferencing and streaming systems care about.

DRIVER 4: OVERLAY (multi-host)
──────────────────────────────
   host A                                    host B
  ┌────────────────────────┐                ┌────────────────────────┐
  │  container X (10.0.0.5)│                │  container Y (10.0.0.6)│
  │       │                │                │       │                │
  │  overlay netns ────────┼── VXLAN tunnel ┼────── overlay netns    │
  │       │                │   over UDP 4789│       │                │
  │  eth0 (real IP)        │                │  eth0 (real IP)        │
  └────────────────────────┘                └────────────────────────┘
        ▲                                          ▲
        └────────── physical L3 network ───────────┘

Teacher voice. Pick bridge until a measurement says you cannot. Reach for host only after you have a profiler showing the loopback hop matters. Use none deliberately, for sandboxing. Use overlay when you have multiple hosts and you are not yet on Kubernetes — and once you are on Kubernetes, the CNI plugin (Calico, Cilium) replaces overlay anyway.


2) Port publishing, DNS resolution, inter-container traffic

The compose stack from chapter 2 wires four services — app, postgres, redis, nginx — onto a single user-defined bridge network. Three mechanics make that wiring work, and each one is the cause of a recurring production bug class.

Port publishing is the iptables rule that maps a host port to a container port. ports: ["8443:8443"] in compose, or -p 8443:8443 on the CLI, tells Docker to insert a DNAT rule on the host: any TCP packet arriving on 198.51.100.5:8443 gets rewritten to the container's bridge IP and forwarded. The asymmetry to remember is that only published ports cross the host boundary. Inter-container traffic on the same bridge skips this whole machinery and goes container-to-container directly.

DNS resolution inside a container on a user-defined bridge points at 127.0.0.11, Docker's embedded resolver. That resolver does two things: resolves other container names on the same network to their bridge IPs, and forwards unknown queries to whatever upstream DNS the host had configured at Docker daemon start time. That second clause is the trap. On Ubuntu and Debian hosts where systemd-resolved listens on 127.0.0.53, Docker cannot use it directly (loopback is per-namespace) and instead snapshots whatever resolvers systemd-resolved knew about when dockerd started. When the host's VPN comes up later and updates the per-link resolvers, Docker keeps forwarding to the old ones — and your containers start failing to resolve auth.internal.corp while dig on the host succeeds. The fix is either to set explicit dns: in daemon.json or restart Docker after network changes, neither of which the compose docs put on the first page.

Inter-container traffic on the default bridge (the unnamed bridge network you get from docker run with no --network) does not support name resolution between containers — only the user-defined bridges compose creates do. This is the single most confusing fact about Docker DNS for new users. Containers on default-bridge reach each other only by IP, and IPs change every restart. The whole reason docker compose creates a project-scoped network is to opt every service into name-based discovery.

Mini-FAQ. "Why do my containers see each other but cannot reach the internet?" You have a user-defined bridge with no default route, or your host's iptables FORWARD chain rejects packets from docker0 to eth0. iptables -L FORWARD and look for DROP rules — security tooling like firewalld or ufw inserts them and Docker's MASQUERADE chain becomes irrelevant.


3) Volumes, bind mounts, tmpfs — and the ownership trap

Three ways to give a container persistent or shared storage, three completely different semantics. Senior engineers get this wrong all the time because the syntax — -v src:dst — looks the same for two of them.

A named volume (-v models:/models or volumes: [models:/models] in compose) is a directory Docker manages under /var/lib/docker/volumes/<name>/_data. Docker creates it, Docker initialises its permissions to match the image's directory at the mount point, Docker manages its lifecycle. For our ML service, the 2 GB embedding model lives in a named volume so a compose down && up does not re-download from S3.

A bind mount (-v ./app:/app or volumes: [./app:/app]) maps an arbitrary host path into the container. Docker does not manage it; ownership and permissions are whatever the host already has. This is great for dev hot-reload (your editor saves a file, the container sees it instantly) and terrible for production secrets (the file in the container is the same inode as on the host).

A tmpfs mount (--tmpfs /tmp or tmpfs: [/tmp] in compose) is an in-memory filesystem the kernel allocates on container start and reclaims on stop. Nothing persists. This is what you mount on /tmp and /run when the rest of your root filesystem is --read-only, and it is also what Kubernetes uses to back Secret volumes so secrets never touch disk.

Now the trap that has cost more weekend hours than any other Docker behaviour. A non-root user inside the container (USER app, uid 10001) tries to write to a bind-mounted ./logs directory the host created with mkdir logs (uid 1000, your dev user). The container process sees uid 10001 attempting to write a directory owned by uid 1000 with mode 0755 — permission denied. Linux only knows uids, not usernames, and uids are not namespaced unless you opt into user-namespace remapping. The three solutions are: (a) match the container's uid to the host's by passing user: "${UID}:${GID}" in compose, (b) chown the host directory to 10001 (works but bakes uid choice into ops), or (c) use a named volume instead, where Docker initialises ownership from the image's mount point. For our ML service we pick (c) for pgdata and models and (a) for the dev-only bind-mount of ./app.

WHEN TO MOUNT WHICH
───────────────────
  named volume   ──── stateful prod data: postgres pgdata, model files
                       Docker-managed lifecycle, ownership-friendly

  bind mount     ──── dev source for hot reload; reading host config files
                       (nginx.conf, certs); host-controlled lifecycle

  tmpfs          ──── /tmp, /run, /var/log/nginx on a read-only rootfs;
                       secrets that must never touch disk

4) Five production Docker bugs that actually bit teams

Sit with me for a minute. These five are not from a textbook; each one has a paper trail of postmortems and engineering blog posts that documented exactly how a team got paged.

Bug 1: logs ate the disk. Docker's default json-file logging driver writes container stdout/stderr to /var/lib/docker/containers/<id>/<id>-json.log with no size cap and no rotation. A documented production incident showed a microservice writing debug logs at 5 MB per minute; over six days the log file accumulated 42 GB and filled the host disk. When /var fills, everything breaks: containers cannot write, Postgres crashes, SSH sessions freeze, and you cannot even log in to clean up. The fix lives in /etc/docker/daemon.json:

{ "log-driver": "json-file",
  "log-opts": { "max-size": "10m", "max-file": "3" } }

The gotcha that bites again here: this config only applies to newly created containers. Existing containers keep their old (unbounded) settings until they are recreated. So after the fix you still have to compose down && up every long-running service.

Bug 2: no graceful shutdown. Chapter 2 unpacked the PID 1 signal trap; here is what it looks like in production. A team's Node.js service ignored SIGTERM entirely; on every deploy Docker sent SIGKILL ten seconds later and dropped every in-flight request. The on-call signal was a quiet spike in 5xx during deploys that nobody attributed to Docker because the app code "had no errors." Fixes: install tini as PID 1, register process.on('SIGTERM', ...) (or your language's equivalent), and bump stop_grace_period: 30s so connection-draining has time. Setting a long grace period is free — Docker only waits as long as the app actually needs.

Bug 3: broken DNS after VPN. The systemd-resolved snapshot trap from section 2. Engineers debug "the app cannot reach the internal API" for hours because dig on the host works fine. Resolution: pin DNS in daemon.json to a stable resolver ("dns": ["1.1.1.1", "8.8.8.8"] for non-internal traffic, or your internal resolver for corporate networks), and restart Docker after any host network change that updates /etc/resolv.conf.

Bug 4: host clock drift. Containers share the host's kernel clock — there is no separate container time. If the host's chronyd/ntpd drifts (or you run on a VM that gets paused and resumed across a clock jump), every container's logs, TLS cert validation, and Postgres replication slot get the same drifted view. The fix is on the host (run chrony, monitor chronyc tracking), not in the container. Production bug shape: TLS handshakes intermittently fail with "certificate not yet valid" after a host resumes from suspend.

Bug 5: healthcheck reports unhealthy, nothing acts on it. Chapter 2 already named this; the production version is uglier than the spec sentence. Our ML service's /healthz flapped under memory pressure — the embedding job spiked memory, the healthcheck call could not allocate a 4 KB urllib buffer, the probe failed, the container went unhealthy. Plain Docker does not restart unhealthy containers. Plain compose does not either. The load balancer kept routing traffic to the unhealthy container because nobody was reading inspect .State.Health. The orchestrator is what closes this loop in production — Kubernetes liveness probes, ECS task health, Nomad checks. On plain compose, the workaround is an external watchdog like willfarrell/autoheal running alongside.

Mini-FAQ. "Are these really the top five?" They are five with documented public postmortems and engineering-blog write-ups. The honourable mentions — image layer cache poisoning, secret leakage via ARG, swap thrashing on cgroup v1 — are real but rarer. These five reliably show up in any team that has run Docker in production for more than a year.


5) Resource limits and OOM — what the kernel actually does

Container resource limits are not just guard rails; they change the kernel's behaviour when memory runs short. The mental model most engineers carry — "the container gets killed when it exceeds the limit" — is half right and the half it gets wrong is the half that causes production incidents.

A --memory=4g flag translates to a memory.max write in cgroup v2 (memory.limit_in_bytes in v1). The kernel tracks RSS plus page cache plus a few other counters against that limit. When the cgroup hits the limit and the kernel cannot reclaim pages fast enough, the kernel OOM killer wakes up and picks a process inside the cgroup to kill, scoring by an oom_score heuristic. Notice the verb: it kills a process, not the container. In a multi-process container — say, a Gunicorn parent with eight workers — the OOM killer might kill one worker. Gunicorn restarts it. Memory pressure stays. Another worker dies. The container is technically still up, the healthcheck still passes for a while, and your latency degrades silently.

This is where cgroup v1 vs v2 matters. Cgroup v2, default since Docker 25.x on most modern distros, introduced memory.oom.group. Writing 1 to it tells the kernel "if any process in this group is OOM-killed, kill them all." Kubernetes since v1.28 sets this to 1 automatically when running on cgroup v2 — so a memory-pressured pod dies clean and gets replaced. Plain Docker still defaults to not setting it; if you want the whole-container-dies semantic, you have to ensure your runtime configures it (or your image's PID 1 dies when a child dies, like tini --propagate-sigterm).

The second cgroup v2 improvement is memory.high — a soft limit. The kernel starts aggressively reclaiming pages when usage crosses memory.high, throttling the cgroup before the hard memory.max is hit. The effect in production is a slow degradation instead of a sudden kill, which is usually what you want.

The third clause that surprises people is swap. With --memory=4g and no --memory-swap, the container can use 4 GB of RAM plus unlimited swap — Docker assumes you want swap available. For a latency-sensitive workload, swap is poison; the OOM kill is preferable. The fix is --memory=4g --memory-swap=4g, which sets swap to zero. Or disable swap on the host (swapoff -a), which most production hosts do.

Threading the ML service example: our embed-svc container has mem_limit: 4g. Under load, the 2 GB model plus a burst of concurrent embeddings can push RSS over 4 GB. On cgroup v2 with memory.oom.group=0 (Docker default), the kernel kills the largest worker. Uvicorn restarts it. The model gets re-loaded from the named volume — 18 seconds of start_period worth of cold start — and during those 18 seconds the healthcheck fails. If memory.oom.group=1 were set, the whole container would die and the orchestrator would replace it cleanly. Either choice is defensible; the wrong choice is not knowing which one you picked.

Teacher voice. A memory limit without thinking about oom.group is a half-configured limit. The default behaviour is "kill one of my workers and let me limp on," which is rarely what you want.


6) Comparison table — bridge vs host vs overlay, version-qualified

Numbers measured against Docker 24.x with the default iptables-nft backend and BuildKit 0.12+. Treat the latencies as order-of-magnitude; your hardware and kernel will move them.

Property bridge (default) host overlay
Added latency per packet ~50-150 µs ~0 µs (no namespace hop) ~200-500 µs (VXLAN encap)
Throughput vs host baseline ~85-95% 100% ~70-90% (5-10% drop unencrypted; more with IPsec)
Port-conflict risk Low — per-container port space High — shares host's ports Low — overlay IP space
Multi-host capability No No Yes (swarm; replaced by CNI on k8s)
Container name DNS Yes (user-defined bridges only) No (uses host's resolver) Yes (service-name resolution across hosts)
iptables NAT involvement DNAT for published ports + MASQUERADE None None on ingress; on host-to-host VXLAN
Best for Default web/API workloads, dev Latency-sensitive RPC, HFT-adjacent Multi-host swarm; mostly legacy in 2026
Typical bug class DNS forwarding, NAT rules dropping Port collisions, isolation gone MTU mismatches, encap overhead surprises

The 50-150 µs bridge tax is invisible for HTTP APIs (where median latency is in the milliseconds) and visible for in-memory cache hits (where you might be aiming at 200 µs end-to-end). The 5-10% throughput gap on overlay is the encapsulation overhead per packet; the 50% better latency consistency observed in IEEE-published measurements is what makes overlay still attractive for jitter-sensitive workloads even when raw throughput is lower.

Mini-FAQ. "Why is overlay slower than bridge if they both add a hop?" Bridge adds a single namespace traversal and an iptables NAT lookup, both in-kernel and fast. Overlay adds the same plus a VXLAN encapsulation, a UDP wrap, and a physical-network round trip to another host. The fixed cost is bigger because more work happens per packet.


7) Image and runtime security basics — the minimum-viable hardening

A container is a process running on the host kernel. Every Linux capability that process holds, every writable path it can touch, every device node it can open, is an attack surface. The minimum-viable hardening — what every production image and every production runtime invocation should have — comes down to four moves.

Move 1: non-root user. The Dockerfile from chapter 2 already added USER app (uid 10001). The reason this matters: if an attacker achieves RCE inside the container as root and finds any container-escape vulnerability (a kernel CVE, a misconfigured volume, a leaked socket), they land on the host as root. As uid 10001 they land on the host as uid 10001 — which can do almost nothing. Cloudflare's container platform enforces this: Cloudflare Containers run without root privileges and require the docker:dind-rootless image even for Docker-in-Docker workloads.

Move 2: read-only root filesystem. --read-only mounts the rootfs read-only; writes to anything outside an explicit volume fail with EROFS. This neutralises a huge class of "RCE writes a webshell" attacks. The catch is that almost every app needs some writable paths — /tmp, /var/run, /var/log/nginx — so the pairing is --read-only plus --tmpfs /tmp --tmpfs /run. Datadog's security default rules list "container's root filesystem should be set to read-only" as a baseline check.

Move 3: drop all capabilities, add back the few you need. Docker's default capability set is fourteen capabilities — chown, dac_override, fowner, kill, setgid, setuid, net_bind_service, net_raw, and others. Most apps use zero of them. Snyk and the Datadog Security Labs container series both recommend the pattern --cap-drop=ALL --cap-add=NET_BIND_SERVICE if you need to bind ports below 1024, otherwise just --cap-drop=ALL. In compose:

services:
  app:
    cap_drop: [ALL]
    cap_add: [NET_BIND_SERVICE]   # only if you bind <1024
    read_only: true
    tmpfs: [/tmp, /run]
    security_opt:
      - no-new-privileges:true

Move 4: no-new-privileges. This prctl(PR_SET_NO_NEW_PRIVS) flag prevents any process in the container from gaining privileges via setuid binaries or file capabilities. Combined with cap-drop=ALL, it forecloses the "setuid-helper-as-privilege-escalation-path" class of attacks.

Threading the ML service: our hardened compose service block now drops all capabilities (we do not need any — we bind port 8080, which is above 1024), mounts root read-only, gets /tmp and /var/run/datadog as tmpfs (the Datadog APM socket pattern from production deployments), and sets no-new-privileges. The image runs as uid 10001. If our embedding endpoint gets RCE'd via a model-input attack, the attacker is uid 10001 with no capabilities on a read-only filesystem — they cannot install tools, escalate to root, or persist anything across restart.

HARDENING LAYERS — DEFENCE IN DEPTH
───────────────────────────────────
  layer 1  USER app (uid 10001)           ── no root inside container
  layer 2  --read-only rootfs              ── no writes outside volumes
  layer 3  --tmpfs /tmp /run               ── writes go to RAM, vanish on stop
  layer 4  --cap-drop=ALL                  ── no Linux superpowers
  layer 5  --cap-add=NET_BIND_SERVICE      ── only the one you actually need
  layer 6  --security-opt=no-new-privileges── no setuid escalation
  layer 7  --pids-limit=200                ── fork bomb cannot reach host

The seventh layer — --pids-limit — closes the fork-bomb hole that the others leave open. Default pids-limit on most distros is unlimited; one runaway loop can spawn 30,000 processes and crash the host.

Teacher voice. Look. Each layer alone is easy to defeat. Seven layers together is a defence-in-depth posture that means a single CVE rarely becomes a host compromise. None of these layers cost you anything in normal operation. The friction is one-time — find the capabilities your app actually uses, write the tmpfs mounts for its writable paths — and the dividend is paid every day after.


Where this lives in the wild

Twenty-four real production systems where the networking, volumes, and gotchas above show up. Split into "networking & runtime choices" and "hardening & operational gotchas."

Networking & runtime choices in production:

  1. Uber Kraken — Uber's open-source P2P Docker registry distributes 20,000 100MB-1GB blobs in under 30 seconds at peak, sitting underneath the overlay networking that Mesos/Aurora used to schedule Uber containers.
  2. Uber Engineering on CPU throttling — Uber's published "Avoiding CPU Throttling in a Containerized Environment" post documents how cgroup CPU quotas interacted poorly with their workloads and how they tuned cpu.cfs_period_us, validating section 5's "limits change kernel behaviour" thesis.
  3. Uber DevPod — Uber's remote dev environment maps compose-shaped multi-container definitions into Kubernetes pods, giving each engineer a personal stack with its own DNS namespace per pod.
  4. Netflix Titus on kernel scaling — InfoQ's coverage of Netflix uncovering kernel-level bottlenecks (global VFS mount locks) at container scale shows that even Netflix's container platform runs into the cgroup/namespace cost ceiling that section 5 names.
  5. Spotify Helios (historical) — Spotify ran dozens of critical backend services on Helios, their homegrown Docker orchestrator that connected helios-agent to the local Docker daemon via the Unix socket — exactly the socket-mount risk Datadog's security default rules flag.
  6. PayPal dce-go — PayPal's open-source Docker Compose Executor for Mesos enforces parent cgroup limits with memory.use_hierarchy=1 so per-container memory pressure escalates correctly to the pod boundary.
  7. High-frequency-trading on Docker host networking — Public engineering write-ups document P99 latency dropping from 12 ms to 9 ms purely by switching from bridge to host networking — exactly the section 1 trade-off, with section 7's security cost.
  8. Cloudflare Containers — Cloudflare's container platform forces non-root execution and rootless Docker-in-Docker (docker:dind-rootless), with iptables manipulation disabled inside containers — section 7's hardening enforced by policy.
  9. Cloudflare Tunnel + Dockercloudflared as a compose service exposes containers without opening host ports, bypassing the section 2 port-publishing iptables rule entirely.
  10. Datadog Agent socket-mount pattern — Datadog's APM agent uses a /var/run/datadog/ bind mount for the trace socket, the documented production pattern that gives APM visibility without granting access to the full Docker socket.
  11. Datadog security default rule "no docker.sock mount" — Datadog's published default rule flags any container that mounts /var/run/docker.sock because it grants full host control — section 4's hidden bug 6.
  12. Grafana Loki v3.6.0 base-image change — The Loki team dropped busybox from their image and broke every compose healthcheck that called wget, illustrating section 4 bug 5 (healthchecks depending on binaries that get removed).

Hardening & operational gotchas:

  1. Snyk Learn "container does not drop default capabilities" lesson — Snyk's official tutorial walks through the --cap-drop=ALL --cap-add=NET_BIND_SERVICE pattern from section 7, framed as the canonical Linux-capabilities lesson.
  2. Datadog Security Labs "Container security fundamentals: Capabilities" — Datadog's published deep-dive lists the fourteen default Docker capabilities and argues that almost no app needs more than one or two; section 7 follows their recommendation directly.
  3. Documented production logs-fill-disk incident (Medium engineering post, 2026) — A microservice writing debug logs at 5 MB/min accumulated 42 GB over six days, filled /var, and took down Postgres and SSH — section 4 bug 1's source case.
  4. Docker daemon.json log-opts industry default — Most production-grade base AMIs (Bottlerocket, AWS ECS-optimized) ship max-size: "10m", max-file: 3 as the default to prevent the logs-eat-disk class.
  5. OneUptime production blog series on Docker resource limits — Document the cgroup v1 vs v2 OOM behaviour differences cited in section 5 and the --memory --memory-swap pairing that disables swap.
  6. Preferred Networks tech blog on cgroup v2 OOM — Documents how Kubernetes v1.28+ on cgroup v2 forces memory.oom.group=1, making whole-pod OOM kill the new default — exactly the section 5 distinction.
  7. GitLab Runner Docker executor volume issues — Public GitLab issues (#3207, #29565, #28121) document recurring bind-mount and named-volume pain in CI, especially around per-job lifecycle and UID alignment from section 3.
  8. Authentik on read-only filesystems with Kubernetes secrets — Authentik issue #2535 documents the tmpfs-on-read-only-rootfs pattern that section 7 prescribes and section 3 lists.
  9. Kubernetes Secret tmpfs backing — Kubernetes mounts Secret volumes as tmpfs (RAM-backed, never written to disk) and always read-only — the production-grade version of section 3's tmpfs use case.
  10. Docker for Linux issue #392 (127.0.0.11 failure) — Long-standing public issue documenting embedded-resolver breakage on hosts using systemd-resolved, validating section 2's "VPN broke my DNS" trap and section 4 bug 3.
  11. Docker for Linux issue #325 (custom networks ignore daemon DNS) — Documents that Docker ignores daemon.json DNS settings on user-defined networks under certain conditions, the other half of the section 2 DNS pitfall.
  12. Airbnb Krispr + kube-gen — Airbnb runs hundreds of microservices on Kubernetes with in-house tooling (Krispr, kube-gen) that mutates manifests to enforce capability-drop, non-root, and read-only defaults — section 7 hardening codified as platform policy.

Pause and recall

  1. Name the four Docker network drivers and the single decision criterion that picks between bridge and host.
  2. Why does the embedded DNS resolver at 127.0.0.11 work on user-defined bridges but not on the default bridge network?
  3. Your container's bind-mounted log directory throws "permission denied" for a non-root user inside. What are three different fixes, and which is the cleanest for production?
  4. What is memory.oom.group, why does it matter for multi-process containers, and what is Kubernetes' default on cgroup v2 since v1.28?
  5. What does --read-only accomplish, and what two paths almost always need to be mounted as tmpfs alongside it?
  6. Why does plain Docker not restart a container that goes unhealthy, and what closes that loop in production?
  7. What is the practical difference between --memory=4g alone and --memory=4g --memory-swap=4g?
  8. Which Linux capabilities should a typical web service have, and which command grants exactly that set?

Interview Q&A

Q1. You are designing a multi-host container deployment and your team is debating bridge vs host vs overlay. How do you frame the decision? A. Three questions in order. First, does the workload run on more than one host that needs to share an L2/L3 namespace? If yes, overlay (or a real CNI plugin if Kubernetes is in scope); if no, ruling out overlay. Second, does the workload have measured latency requirements below the 50-150 µs bridge tax? If yes, host; if no, bridge by default. Third, does the workload need to bind specific privileged ports the host already uses, and is the team comfortable losing network isolation? Host only if both yes. Bridge is the right default for 90% of production workloads; host is the optimisation reserved for measured pain; overlay is for legacy swarm setups where Kubernetes is not on the roadmap. Common wrong answer to avoid: "Always use host for performance." You lose port isolation, you cannot run two replicas on the same host, and you trade away a security property that's worth keeping unless a profiler says otherwise.

Q2. Your container is OOM-killed in production but the container keeps running, just with degraded latency. Explain what's happening and how you fix it. A. You're on a multi-process container — Gunicorn parent with workers, or any prefork model. The cgroup hit memory.max, the kernel OOM killer woke up, scored processes, and killed one worker. The container's PID 1 is still alive, so the container is "up" even though one worker died. Gunicorn respawns it, memory pressure resumes, another worker dies, latency stays bad, healthcheck stays green. The fix on cgroup v2 is memory.oom.group=1, which makes the kernel kill the entire cgroup on any OOM event — the container exits, the orchestrator replaces it. Kubernetes v1.28+ sets this automatically on cgroup v2; plain Docker does not. On plain Docker you ensure PID 1 propagates child exits, or wrap with tini --propagate-sigterm. Common wrong answer to avoid: "Raise the memory limit." That treats the symptom. If the workload actually needs more memory, fine — but first understand whether you have an oom.group misconfiguration that's hiding the real failure mode.

Q3. A container's healthcheck shows unhealthy but it's still serving traffic and your load balancer is happily routing to it. Why, and what's the production fix? A. Docker's HEALTHCHECK reports status; it does not act on it. Plain Docker and plain compose only restart containers on process exit, not on unhealthy state. The orchestrator — Kubernetes liveness probes, ECS task health, Nomad checks — is what wires unhealthy to traffic removal and container replacement. If you're running on compose in production, your options are: (a) external watchdog like willfarrell/autoheal that watches health state and restarts unhealthy containers, (b) load balancer that reads health via docker inspect, or (c) move to a real orchestrator. The healthcheck is still valuable as observable state; it just isn't self-acting. Common wrong answer to avoid: "Docker must be broken; healthcheck should restart unhealthy containers." This is documented behaviour, not a bug. Docker has never restarted on unhealthy state.

Q4. Why is mounting /var/run/docker.sock into a container considered dangerous, and when is it acceptable? A. The Docker socket is the daemon's full control API. Any process that can write to it can create privileged containers, mount host paths, run arbitrary commands as root on the host. Mounting it into a container is effectively granting that container root on the host. The acceptable cases are narrow: a deliberately-trusted sidecar like the Datadog Agent that needs to enumerate containers for monitoring, a CI runner you've already trusted, a local dev environment. The mitigation pattern is to bind-mount only a specific sub-socket (Datadog's /var/run/datadog/apm.socket pattern from section 7) rather than the full daemon socket, or to use Docker's socket-proxy projects that filter API calls. Common wrong answer to avoid: "It's fine if the container is non-root." The socket is at /var/run/docker.sock on the host; once the container can write to it, root inside the container is irrelevant to the host privilege boundary.

Q5. Your team is moving from cgroup v1 to cgroup v2 and you've been asked what changes operationally. What are the three biggest differences? A. First, memory.oom.group — cgroup v2 introduces the option to make OOM kill the whole cgroup atomically, which Kubernetes v1.28+ enables by default. This changes multi-process container failure semantics. Second, memory.high — a soft limit below memory.max that triggers aggressive reclaim instead of immediate kill, letting you see slow degradation instead of cliff-edge failure. Third, the unified hierarchy — v2 has a single hierarchy for CPU, memory, IO, pids; v1 had separate hierarchies per controller, which made joint scheduling decisions inconsistent. Operationally, you'll also see the limit-enforcement become immediate rather than slightly laggy, so OOM kills happen at the limit rather than 50-200 MB past it. Common wrong answer to avoid: "Just bump the kernel version; cgroup v2 is automatic." On most distros it is, but Docker has to be configured to use the v2 driver (native.cgroupdriver=systemd) and your monitoring/alerting needs to know about the new metric paths.

Q6. How would you harden a public-facing web container against compromise? A. Six layers, all configurable in compose: USER to a non-root uid in the Dockerfile; read_only: true on the service; tmpfs: [/tmp, /run] for the small writable paths the app actually needs; cap_drop: [ALL] with cap_add: [NET_BIND_SERVICE] only if you bind a privileged port; security_opt: [no-new-privileges:true] to block setuid escalation; and pids_limit: 200 to bound fork bombs. Then on top of that: a digest-pinned base image (not latest), a multi-stage build that drops build tools, and BuildKit secret mounts for any tokens. Each layer alone is easy to bypass; together they reduce a typical app-layer RCE to "attacker is uid 10001 with no capabilities on a read-only filesystem who can fork at most 199 times" — which is rarely worth pursuing. Common wrong answer to avoid: "Run a vulnerability scanner and patch CVEs." Scanning is good, but it's not hardening. You can ship a CVE-free image as root with full capabilities and a writable rootfs and still get compromised on day one via app-layer RCE.

Q7. After a routine deploy, your containers cannot resolve internal hostnames but dig on the host works fine. Walk through your debugging. A. This is almost always the systemd-resolved snapshot trap. Docker's embedded resolver at 127.0.0.11 forwards unknown queries to upstream resolvers it snapshotted at daemon start time. If the host uses systemd-resolved (Ubuntu/Debian default), Docker can't use 127.0.0.53 directly because it's a per-namespace loopback; it grabs whatever real upstream resolvers were configured when dockerd started. When the VPN comes up later and updates the per-link resolvers, Docker keeps forwarding to stale ones. Debug: docker exec into the container, cat /etc/resolv.conf (shows 127.0.0.11), then dig @127.0.0.11 internal.host versus dig @1.1.1.1 internal.host. Fix: pin explicit DNS in /etc/docker/daemon.json ("dns": ["10.0.0.2", "8.8.8.8"]) and systemctl restart docker. Long-term: stop relying on the embedded resolver snapshot for internal DNS; run a proper internal resolver or use service-mesh DNS. Common wrong answer to avoid: "Restart the containers." Restarting containers doesn't fix the daemon's snapshotted upstream; you have to restart dockerd itself.

Q8. You're told a container is using "too much memory" by the host monitoring but the app inside reports normal RSS. What are the suspects? A. Three suspects. First, page cache — the kernel counts file-backed page cache against the cgroup's memory in cgroup v1 and partially in v2. Heavy disk reads bloat container memory accounting without the app's RSS changing. Second, tmpfs mounts — anything written to --tmpfs counts against memory.max because tmpfs is RAM-backed. A noisy /tmp can silently consume your limit. Third, child processes the app doesn't track — anything PID 1 forked, plus orphaned zombies if no init reaper is in place. Diagnosis: docker stats (gives total cgroup usage), cat /sys/fs/cgroup/<path>/memory.stat for the breakdown of anon vs file vs kernel, and ps inside the container for processes the app doesn't think it owns. Common wrong answer to avoid: "The app has a memory leak." Possibly, but the host accounting includes things the app's RSS doesn't see; check page cache and tmpfs before pointing at app code.


Apply now (10 min)

Step 1 — model the audit. I'll fill one row of the "Docker production readiness" audit for our ML embed-svc so you can copy the shape.

Checkpoint Our embed-svc setting Red flag in a different stack
Network driver bridge (compose default network); no network_mode: host host with multiple replicas competing for 8080
DNS strategy Explicit dns: [10.0.0.2] in daemon.json Relying on systemd-resolved snapshot
Volumes Named volume for models, pgdata; bind mount only for dev ./app Bind-mount production data with mismatched uids
Log driver json-file with max-size: 10m, max-file: 3 Default unbounded json-file
Memory limit mem_limit: 4g, memswap_limit: 4g, oom.group=1 via tini --memory=4g only; swap unlimited; multi-worker silent OOM
Healthcheck action External autoheal sidecar reads health and restarts Healthcheck set but nothing watches it
Hardening Non-root, read-only rootfs, /tmp tmpfs, cap-drop ALL, no-new-privileges, pids-limit 200 Root user, writable rootfs, default capabilities

Step 2 — your turn. Pick one running container from your own stack (a sidecar, an app, a database). Fill the same seven rows. For every red, name the first command you would run to fix it. Three or more reds and that container should not be in production tomorrow.

Step 3 — sketch from memory. Redraw the four-driver network diagram from section 1 and the seven-layer hardening stack from section 7. Side by side. Label every box with what it isolates and every arrow with what crosses it. If you can do both cold, you have the model that distinguishes lead-tier "I know Docker" from candidate-tier "I use Docker."


Bridge. You can now pick the right network driver, mount the right kind of volume, set resource limits that fail predictably under pressure, and harden an image so a single CVE rarely becomes a compromise. The last chapter in this module stops asking how to run Docker and starts asking when to use it — Docker versus virtual machines, Docker versus Kubernetes, the interview-shape comparisons you will be asked to defend in any system-design round. → 04-vs-vms-kubernetes-interview.md