02. Dockerfile and compose — the daily Docker grammar¶

~18 min read. The image internals from chapter 1 explain what Docker is. This chapter is about the two files you actually touch every working day — the Dockerfile that builds the image and the compose.yaml that wires several of those images into a dev stack. We continue the ML inference service from chapter 1 and grow it into a four-container stack (app + postgres + redis + nginx). By the end you will know which Dockerfile instructions matter, which ten mistakes to stop making, when docker compose is enough, and when it stops being enough.

Builds on: 00-eli5.md and 01-images-layers-oci-internals.md.

1) The Dockerfile as a recipe — every instruction is a layer or a hint¶

A Dockerfile is not a script. It is a declarative recipe that BuildKit walks top to bottom, turning most instructions into a new image layer and some instructions into metadata baked into the image config. Chapter 1 showed you that layers are content-addressed tarballs and the config is a JSON blob; the Dockerfile is the human-readable spec that produces both. Once you internalise which instruction does which thing, you stop writing Dockerfiles that build slowly and you start writing Dockerfiles that BuildKit can cache aggressively.

Let me show you the recipe for our running example. The ML inference service from chapter 1 — FastAPI serving a sentence-transformers embedding model, depending on torch, transformers, numpy, and reading from a Postgres metadata table — has this Dockerfile shape:

# syntax=docker/dockerfile:1.7
ARG PYTHON_VERSION=3.12

# ---------- stage 1: build deps ----------
FROM python:${PYTHON_VERSION}-slim AS builder
ENV PIP_DISABLE_PIP_VERSION_CHECK=1 \
    PIP_NO_CACHE_DIR=0 \
    PYTHONDONTWRITEBYTECODE=1
WORKDIR /build
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential git && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --prefix=/install -r requirements.txt

# ---------- stage 2: runtime ----------
FROM python:${PYTHON_VERSION}-slim AS runtime
RUN groupadd -r app && useradd -r -g app -u 10001 app
WORKDIR /app
COPY --from=builder /install /usr/local
COPY --chown=app:app ./app /app
USER app
ENV PYTHONUNBUFFERED=1 \
    MODEL_PATH=/app/models/embed.bin \
    PORT=8080
EXPOSE 8080
STOPSIGNAL SIGTERM
HEALTHCHECK --interval=10s --timeout=3s --start-period=20s --retries=3 \
    CMD python -c "import urllib.request,sys; \
        sys.exit(0 if urllib.request.urlopen('http://127.0.0.1:8080/healthz',timeout=2).status==200 else 1)"
ENTRYPOINT ["tini", "--", "python", "-m", "uvicorn"]
CMD ["app.main:app", "--host", "0.0.0.0", "--port", "8080"]

Read that file as three things stacked together. Filesystem instructions — FROM, RUN, COPY, ADD — each one creates a new layer blob. Metadata instructions — ENV, ARG, EXPOSE, USER, WORKDIR, LABEL, STOPSIGNAL, HEALTHCHECK — these do not produce layers; they write fields into the image config JSON that chapter 1 showed sits beside the layer tree. Execution instructions — ENTRYPOINT and CMD — only describe what runs when a container starts; nothing actually runs at build time.

Teacher voice. ENTRYPOINT is what the container is. CMD is the default arguments you can override. Use ENTRYPOINT for the binary, CMD for the flags, and you get override-friendly images. Use only CMD and your docker run myimage bash works; use only ENTRYPOINT in shell form and signals stop working. Both together in exec form is the production-grade shape.

The ARG versus ENV distinction trips people up every week. ARG PYTHON_VERSION=3.12 is a build-time variable — you can override it with --build-arg, and it disappears the moment the build finishes. ENV MODEL_PATH=/app/models/embed.bin is baked into the image config and visible at runtime to every process in the container. Build args do not leak into the running image (good for non-secret toggles); env vars do (good for runtime config, bad for secrets — section 6 fixes that).

2) The ten Dockerfile mistakes you must stop making¶

I have reviewed enough Dockerfiles to confidently say that almost every "slow build" or "huge image" complaint reduces to one of these ten. Read them as a checklist you run mentally before opening a PR.

Layer ordering ignores cache invalidation. You COPY . . early, then RUN pip install. Now every code change re-runs pip install. The fix is in our example above: copy requirements.txt first, install, then copy code last. Cheap layers near the bottom, expensive layers near the top, dependencies before code.
Running as root. The default USER is root. If your app gets RCE, the attacker is root inside the container, and on a misconfigured host that is one CVE away from root on the host. Add a non-root user and USER app before ENTRYPOINT.
No .dockerignore. Your build context includes .git/, node_modules/, .venv/, __pycache__/, your 4 GB of test fixtures. The daemon ships all of it over the socket before the build even starts. A two-line .dockerignore cuts context size by 80% and stops you from accidentally COPYing secrets.
apt-get update without apt-get install in the same layer. Split across two RUN lines, the update layer caches forever and your install layer pulls stale package indices. Always chain: RUN apt-get update && apt-get install -y --no-install-recommends ... && rm -rf /var/lib/apt/lists/*.
pip install without a cache mount. Without --mount=type=cache,target=/root/.cache/pip, every dependency change re-downloads every wheel. With it, pip downloads only the new wheel. Section 6 walks through this.
COPY . . instead of COPY --chown=app:app . .. Files land owned by root; the non-root user cannot write to them. You discover this only when the app first tries to write a log file.
No STOPSIGNAL, no tini, app does not handle SIGTERM. Container becomes PID 1; PID 1 in Linux has special signal semantics — most signals are ignored unless the process explicitly handles them. Section 4 unpacks this.
ADD when you meant COPY. ADD auto-extracts tarballs and fetches URLs — usually surprising, occasionally dangerous. Use COPY for files and a RUN curl for URLs. Reserve ADD only for the tarball case where you actually want extraction.
No multi-stage build. Your final image ships build-essential, git, gcc, header files, and the entire pip download cache. The runtime stage in our example pulls only the installed packages forward from the builder stage, dropping ~400 MB.
Latest tags in FROM. FROM python:latest means your build today and your build tomorrow may produce different binaries. Pin the minor (python:3.12-slim) at minimum; pin the digest (python:3.12-slim@sha256:...) for production-critical images. Chapter 1 made the digest-vs-tag distinction; here is where you act on it.

Mini-FAQ. "Will fixing all ten make my image small?" It will make it correct. Smallness comes from the multi-stage build, the slim/distroless base, and removing build tools. Correctness comes from the other eight.

3) `docker compose` and the dev/prod parity question¶

docker compose is a YAML file plus a CLI that takes one declarative description of several containers and turns it into a running local stack. It is the killer feature for local development of multi-service systems, because it lets every engineer on your team type docker compose up and get the same database, the same cache, the same reverse proxy, the same app — wired with the same network names and the same volume mounts. The dev/prod parity question is: how far past local-dev can you ride this?

Our ML service grows. The app needs a Postgres for embedding-job metadata, a Redis for caching frequent query results, and an Nginx in front for TLS termination plus request shaping. The compose file:

# compose.yaml
services:
  app:
    build:
      context: .
      target: runtime
    image: embed-svc:dev
    environment:
      DATABASE_URL: postgresql://embed:embed@postgres:5432/embed
      REDIS_URL: redis://redis:6379/0
      MODEL_PATH: /models/embed.bin
    volumes:
      - ./app:/app:ro          # bind mount for hot reload in dev
      - models:/models:ro       # named volume for the 2 GB model
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    healthcheck:
      test: ["CMD", "python", "-c",
             "import urllib.request,sys;sys.exit(0 if urllib.request.urlopen('http://127.0.0.1:8080/healthz',timeout=2).status==200 else 1)"]
      interval: 10s
      timeout: 3s
      retries: 3
      start_period: 20s
    deploy:
      resources:
        limits:
          memory: 4g
          cpus: "2.0"
    stop_grace_period: 30s

  postgres:
    image: postgres:16.4-alpine
    environment:
      POSTGRES_USER: embed
      POSTGRES_PASSWORD: embed
      POSTGRES_DB: embed
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U embed -d embed"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7.4-alpine
    command: ["redis-server", "--save", "60", "1", "--maxmemory", "512mb",
              "--maxmemory-policy", "allkeys-lru"]
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 2s
      retries: 5

  nginx:
    image: nginx:1.27-alpine
    ports:
      - "8443:8443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      app:
        condition: service_healthy

volumes:
  pgdata:
  models:

Notice five compose features that earn their keep here. depends_on.condition: service_healthy gates startup on the upstream container's healthcheck, not just on its process existing. Named volumes (pgdata, models) survive docker compose down while bind mounts (./app) give you hot-reload in dev. Service names (postgres, redis) become DNS names on the default user-defined bridge network — no IP juggling. The target: runtime lets you build a specific stage from the multi-stage Dockerfile. And stop_grace_period: 30s gives the app time to drain in-flight requests before SIGKILL.

Now the parity question. Compose is good enough for production when: one host serves all traffic, the workload fits one machine, downtime during deploy is acceptable, and you have nothing to roll back to. Compose stops being enough when you need multi-host scheduling, rolling updates with health-gated traffic shift, secret rotation without rebuilding, autoscaling, or cross-AZ redundancy. The Distr.sh 2026 review of "should I run plain compose in production" lists five concrete gaps even single-host compose leaves open: log/image cleanup, action-on-unhealthy (Docker reports unhealthy; compose does not restart unhealthy containers — only crashed ones), socket security, atomic updates, and disk pressure. The Grafana Loki team hit one of these directly when v3.6.0 dropped busybox from the image and broke every compose healthcheck that called wget.

Teacher voice. Compose is the friendly local-dev contract. Past one host, you are picking either Kubernetes (rich, complex), ECS/Fargate (managed, opinionated), or Nomad (compose-shaped, multi-host). Do not pretend compose scales horizontally — it scales by replicating processes on the same host, and compose up --scale app=3 does not load-balance across hosts.

4) Healthchecks, restart policies, graceful shutdown — the PID 1 problem¶

A container's lifecycle has three moments where bad defaults bite. Startup — when is the container actually ready to serve, versus merely running? Steady state — what counts as healthy, and what does the system do when it stops being so? Shutdown — what happens between SIGTERM and SIGKILL? Each moment maps to a Dockerfile or compose feature.

For startup, HEALTHCHECK plus start_period is the answer. The start_period is the grace window where failing probes do not count against the retry budget — perfect for our ML service which spends 18 seconds loading the 2 GB model before it can answer /healthz. Set start_period: 20s and the container does not flap into unhealthy during normal warm-up.

For steady state, healthchecks plus restart policies close the loop — but only partially. The container restart policy (restart: unless-stopped, on-failure, always) reacts to process exit, not to unhealthy. A container that goes unhealthy but does not exit will keep running, marked red, doing nothing useful. Plain Docker does not act on unhealthy. Compose v3.x ignores deploy.restart_policy.condition: on-failure outside swarm. In production, the orchestrator (Kubernetes, ECS) is what wires healthcheck failures to traffic removal and pod replacement; compose does not.

For shutdown, you must understand PID 1 in Linux. The kernel treats PID 1 specially: signals like SIGTERM, SIGINT, SIGUSR1 are not delivered unless the process explicitly installs a handler. Your typical Python or Node app does not. So when Docker sends SIGTERM, the app ignores it, Docker waits ten seconds (stop_grace_period), then sends SIGKILL. In-flight requests die mid-write.

SIGTERM HANDLING WITH AND WITHOUT TINI
──────────────────────────────────────

  docker stop                            docker stop
       │ SIGTERM                              │ SIGTERM
       ▼                                      ▼
  ┌─────────────┐                       ┌─────────────────────┐
  │ PID 1: app  │                       │ PID 1: tini         │
  │ (python)    │                       │  └─ PID 2: app      │
  │             │                       │       (python)      │
  │ SIGTERM     │                       │ tini forwards       │
  │ IGNORED     │  ──── 10 s ────►      │ SIGTERM to child    │
  │ by kernel   │                       │ app drains, exits   │
  │             │                       │ tini reaps, exits   │
  ▼                                      ▼
  SIGKILL                                CLEAN EXIT
  (in-flight req dies)                   (in-flight req drains)

Three fixes. Option A: install a tiny init like tini and use it as ENTRYPOINT ["tini", "--", ...] — what our example does. Option B: start your container with docker run --init, which injects tini for you. Option C: handle SIGTERM in the app code (FastAPI/Uvicorn does this if invoked correctly; Node's process.on('SIGTERM', ...) works once you remember PID 1 will receive nothing unless you write the handler). STOPSIGNAL SIGTERM in the Dockerfile is the default anyway but writing it makes the intent explicit and lets you switch to SIGINT for tools like Python that prefer it.

5) The daily debugging trio — `exec`, `logs`, `inspect`¶

When something is wrong inside a container, three commands are the entire toolbox most days. Memorise the flags.

docker exec -it <container> sh lands you inside the running container. The -i keeps stdin open, the -t allocates a TTY. Together they give you an interactive shell. For our ML service, when an embedding request returns garbage, the first move is docker exec -it embed-svc-app-1 bash to check the model file is mounted at /models/embed.bin and the right size. If your image is distroless and has no shell, docker exec cannot help; that is the trade for the security gain.

docker logs -f --tail 200 --since 5m <container> is the second move. The -f follows new lines, --tail limits the backfill, --since accepts both relative (5m, 1h) and absolute times. Combine with --timestamps to correlate with external monitoring. For multi-container stacks, docker compose logs -f --tail 100 app nginx interleaves output across services. Logs are the writable layer's responsibility unless you set up a logging driver — by default they go to /var/lib/docker/containers/<id>/<id>-json.log and grow without bound, which is mistake number 11 if I were extending the section 2 list.

docker inspect <container> is the third move and the one most engineers underuse. It dumps the full state of a container or image as JSON — network IPs, mounts, env vars, healthcheck status with last five outputs, OOM kill flag, exit code, restart count. Pipe through jq and you have an answer to almost any "what is the current state of this thing" question. The healthcheck history alone is worth memorising: docker inspect --format '{{json .State.Health}}' embed-svc-app-1 | jq shows the last probe results with timestamps and exit codes, which is how you debug a flaky /healthz without reproducing the failure locally.

Mini-FAQ. "What if exec fails with OCI runtime exec failed?" Almost always means the binary you tried to run (bash, sh) does not exist in the image. Distroless and scratch images have no shell. Use a debug sidecar (docker run --rm --pid=container:<id> --network=container:<id> nicolaka/netshoot) to share namespaces with the target container — you get a fully-loaded toolbox without touching the original image.

A fourth honourable mention: docker stats for live CPU/memory/IO per container, which tells you in five seconds whether the OOM kill in inspect was caused by your own memory limit or by the host running out. Section 7 has the version-qualified numbers.

6) BuildKit features you should actually use¶

BuildKit replaced the legacy builder as Docker's default in version 23.0 and unlocked four features that change Dockerfile-writing economics. If you are not using them, your CI is slower and your images leak secrets you do not know about.

Cache mounts — RUN --mount=type=cache,target=/root/.cache/pip persists a directory across builds but does not include it in the final image. The pip download cache survives requirements.txt changes; only the wheels that are actually new download. For our ML service with torch (~700 MB of wheels), this turns a 4-minute cold install into a 20-second warm install. Apply it to apt (/var/cache/apt), npm (/root/.npm), Go (/go/pkg/mod), Maven (/root/.m2) — any package manager.

Secret mounts — RUN --mount=type=secret,id=npm_token npm install exposes the secret as a file inside the RUN command only. The secret never lands in any layer, the image config, or the build cache. Provide it from the host with DOCKER_BUILDKIT=1 docker build --secret id=npm_token,src=$HOME/.npmrc . This is the only correct way to pass private-registry credentials, GitHub tokens, or model-bucket signing keys at build time. ARGs show up in docker history; secret mounts do not.

ARG vs SECRET — WHERE THE VALUE ENDS UP
────────────────────────────────────────

  ARG PIP_TOKEN=...               --mount=type=secret,id=pip_token
       │                                       │
       ▼                                       ▼
  ┌──────────────┐                        ┌──────────────┐
  │ build layer  │                        │ build layer  │
  │ (visible in  │                        │ (token in    │
  │  history!)   │                        │  /run/secrets│
  └──────┬───────┘                        │  only)       │
         │                                └──────┬───────┘
         ▼                                       │ unmount on
  ┌──────────────┐                               ▼ RUN exit
  │ image config │                        ┌──────────────┐
  │ (still has   │                        │ image config │
  │  the ARG!)   │                        │ (clean,      │
  └──────────────┘                        │  no secret)  │
                                          └──────────────┘

SSH mounts — RUN --mount=type=ssh git clone git@github.com:private/repo.git forwards your ssh-agent into the build. Same property as secret mounts: the key never touches a layer. Use it for cloning private dependencies during build.

Bind mounts at build time — RUN --mount=type=bind,source=.,target=/src,readonly lets a single RUN step read from the build context without COPYing it into a layer. Combine with a cache mount to compile in /src while keeping the source out of the final image; only the compiled output ends up in the layer. This is how Go and Rust multi-stage builds get truly small final images.

Teacher voice. If you find yourself passing a token through an ARG, stop and reach for --mount=type=secret. Build args persist forever in docker history. Secret mounts do not. There is no exception to this rule.

For BuildKit + GitHub Actions specifically, the type=gha cache backend with mode=max saves not just the layer cache but cache-mount contents across CI runs. HyperDX measured 50% Docker build-time reductions from this single switch on a typical Node/Python stack.

7) Comparison table — `docker run` flags and their measured behaviour¶

Senior engineers get paid to know what each flag actually does at the kernel level. Here is the daily-use subset, qualified by Docker 24.x and BuildKit 0.12+ where behaviour has drifted across versions.

Flag	What it sets	Measured effect	Gotcha
`--memory=4g`	cgroup v2 `memory.max`	OOM kill at 4 GB RSS (kernel decides which process)	Setting without `--memory-swap` leaves swap unlimited; usually you want `--memory=4g --memory-swap=4g` to disable swap
`--cpus=2.0`	cgroup v2 `cpu.max=200000 100000`	Throttled to 2 CPU-equivalents averaged over 100ms windows	Bursts above 2 are silently throttled; observed in `nr_throttled` counter
`--cpuset-cpus=0-3`	cgroup `cpuset.cpus`	Pinned to physical cores 0-3 only	Plays badly with k8s/Nomad which expect the orchestrator to pin
`--pids-limit=200`	cgroup v2 `pids.max`	New `fork()` returns EAGAIN at limit	Default unlimited on most distros; one fork bomb kills the host
`--network=host`	Skips network namespace	Saves ~30µs/req for tight loopback workloads	Container shares host's ports; collisions; not portable
`--network=bridge` (default)	New netns + veth pair on `docker0`	~50-150µs added per packet vs host	NAT through iptables; visible in `iptables -t nat -L`
`--network=none`	Empty netns, only loopback	No external connectivity	Useful for sandboxing untrusted code
`--read-only`	rootfs mounted ro	Writes to non-volume paths fail with EROFS	Almost always also need `--tmpfs /tmp`
`--ulimit nofile=65536:65536`	RLIMIT_NOFILE inside container	App can open 65k fds	Default 1024 trips Nginx, Postgres, anything with many connections
`--restart=unless-stopped`	docker daemon policy	Restarts on exit non-zero or daemon restart	Does not react to HEALTHCHECK unhealthy state
`--init`	Inserts tini as PID 1	Signals forwarded, zombies reaped	Free if you have not already added tini in the Dockerfile
`--gpus=all`	NVIDIA runtime hook	Mounts `/dev/nvidia*` and CUDA libs	Needs `nvidia-container-toolkit` on host; chapter 1 covered the registry side

Two numbers worth remembering. Bridge network adds roughly 50-150 microseconds per packet versus host on Docker 24 with default iptables-nft — measurable for cache backends, invisible for HTTP APIs. Memory limit enforcement under cgroup v2 (default since Docker 25) is immediate rather than the slightly laggy cgroup v1 behaviour, which means your OOM kills now happen at the limit, not 50-200 MB past it.

Where this lives in the wild¶

The Dockerfile + compose patterns above are not academic; here are 24 places they show up in real production systems, gathered from engineering blogs and primary sources, split into "build-pattern shops" and "compose-and-stack shops."

Build-pattern shops (Dockerfile-heavy):

Netflix Titus — Netflix's container platform consumes Dockerfile-built images as the unit of deployment, with their Bakery pipeline producing AMIs and container images that Titus schedules across EC2 fleets.
Shopify — Shopify maintains a 125-line shared base Dockerfile defining 25 packages spanning Ruby/Python/Node runtimes, used by all storefront containers; documented in "Docker at Shopify: How We Built Containers that Power Over 100,000 Online Shops."
Uber Makisu — Uber's Makisu performs Dockerfile builds inside unprivileged Kubernetes pods using a distributed layer cache, optimised for the monorepo build patterns described in section 1's cache-ordering rules.
Pinterest — Pinterest dockerised 100% of its API fleet and stateless services, using a base-image strategy similar to Shopify's and storing artefacts in ECR with a replicated secondary registry.
Hugging Face Spaces — Spaces accepts user-provided Dockerfiles and builds GPU-aware images on top of nvidia/cuda bases, applying many of the multi-stage and cache-mount patterns from section 6.
NVIDIA NGC — NVIDIA publishes the canonical CUDA/PyTorch/Triton Dockerfiles on nvcr.io, demonstrating the pin-by-digest, non-root, multi-stage shape used as a starting point by most ML teams.
Replit — Each Repl runs in a Dockerfile-defined image that drops to a non-root runner user, illustrating mistake #2 fixed at scale.
GitLab Runner — GitLab Runner's Docker executor spins per-job containers with cache mounts and registry-backed layer cache (type=registry,ref=$CACHE_IMAGE), the production pattern called out in section 6.
GitHub Actions docker/build-push-action — Uses type=gha,mode=max as the recommended cache backend, measured in HyperDX's case study to cut build times by 50%.
Datadog Agent — Distributed as a Dockerfile-built image with HEALTHCHECK and STOPSIGNAL set, mounted with the host Docker socket to auto-discover sibling containers.
PayPal — Containerised 700+ apps via Docker Enterprise using a standardised base-image Dockerfile family, cutting deploy time by ~90% according to their case study.
Bloomberg Data Science Platform — Builds ML training images via Dockerfiles deployed on Kubernetes with KServe, applying multi-stage builds and digest-pinned bases.

Compose-and-stack shops (compose.yaml-heavy or descendants):

Stripe CLI — Stripe ships stripe/stripe-cli and documents the canonical compose.yaml snippet pairing it with a developer's app for webhook forwarding via stripe listen --forward-to web:3000.
VS Code Dev Containers — devcontainer.json builds on docker-compose.yml to define multi-service dev environments (app + DB + cache), the same shape as our section 3 example.
GitHub Codespaces — Uses dev-container compose files to provision identical multi-service environments per branch.
Pinterest Teletraan and Plank — Both open-source Pinterest projects ship docker-compose.yml for local development, evidence that even mature platforms keep compose for the dev loop.
Uber DevPod — Uber's remote dev environment uses compose-shaped multi-container definitions inside Kubernetes pods to give each engineer a personal monorepo stack with pre-cloned services.
Grafana Loki — Loki's official install path with Docker Compose ships a multi-service compose.yaml; the v3.6.0 base-image change broke wget healthchecks, illustrating section 4's healthcheck-binary trap.
Jupyter Docker Stacks — scipy-notebook, tensorflow-notebook, and friends ship as Dockerfiles many teams wrap in compose files for classroom-scale multi-user setups.
OneUptime, Plausible, Mastodon, Sentry — All four open-source self-hostable products ship production docker-compose.yml files as their primary install path, demonstrating section 3's "compose is enough for single-host production" envelope.
n8n — n8n's recommended self-host path is a compose stack pairing app + Postgres + reverse-proxy, exactly the shape of our ML stack.
Supabase CLI — supabase start spins up a compose-style stack (Postgres, Realtime, Auth, Storage, Studio) for local development of apps that will later deploy to managed Supabase.
LocalStack — AWS-emulation container that teams pull into their compose.yaml so app code can talk to fake S3/SQS/Lambda locally with the same SDK calls as production.
Distr.sh's 2026 production-compose guide — Documents the operational gaps (cleanup, healing, socket security, atomic updates, disk pressure) you must close to run plain compose in production — exactly the "stops being enough" boundary section 3 named.

Pause and recall¶

Which Dockerfile instructions create layers, and which only modify the image config?
In the ML-service Dockerfile, why does COPY requirements.txt . appear before COPY ./app /app?
What is the difference between ARG and ENV, and which one is unsafe for secrets?
Why does restart: unless-stopped not react to a container being marked unhealthy?
What does tini do that a Python app running as PID 1 does not?
Why is --mount=type=secret strictly better than ARG for passing build-time credentials?
Name three failure modes that show up at the boundary where compose stops being enough.
What does docker inspect --format '{{json .State.Health}}' show you, and when would you reach for it?

Interview Q&A¶

Q1. Walk me through how you would optimise a Dockerfile that takes 12 minutes to build. A. First, look at layer ordering — copy dependency manifests before source code so that pip/npm install caches across code changes. Second, switch to multi-stage builds so build tools do not ship in the runtime image. Third, add BuildKit cache mounts (--mount=type=cache,target=/root/.cache/pip) so dependency caches persist across builds. Fourth, add a .dockerignore so the build context does not include .git, node_modules, fixtures. Fifth, pin a slim base (python:3.12-slim) and verify with docker history that no layer is wasted. Each of these typically cuts 30-60% off the build; together they often turn 12 minutes into 90 seconds. Common wrong answer to avoid: "Use --no-cache to make builds reproducible." That makes them slow, not optimised; reproducibility comes from digest-pinned bases, not from disabling cache.

Q2. What is the difference between ENTRYPOINT and CMD? A. ENTRYPOINT declares what the container is — the binary that always runs. CMD provides default arguments to that binary, overridable from the command line. The production-grade shape is exec-form ENTRYPOINT ["myapp"] plus CMD ["--default-flag"]; this lets docker run myimage --other-flag override CMD while keeping the binary fixed. If you put the binary in CMD alone, anyone can replace it; if you put everything in ENTRYPOINT, you lose argument overrides. Both in exec form preserves signal handling — shell form spawns a /bin/sh -c wrapper that intercepts SIGTERM. Common wrong answer to avoid: "They are interchangeable, use whichever feels natural." They are not — the difference shows up in signal forwarding and CLI overrides.

Q3. Why might a container exit immediately after docker run without any error? A. Because the foreground process exited. Containers live exactly as long as PID 1 lives. Common causes: the CMD was a shell builtin not found (bash: someapp: command not found exits silently if logged elsewhere); the binary returned 0 immediately because it expected stdin and there was no -i; the entrypoint ran in the background and the foreground process was just an &. Always check docker logs <container> and docker inspect --format '{{.State.ExitCode}}' to distinguish "started and died" from "never started." Common wrong answer to avoid: "Docker bug, restart the daemon." Containers exiting fast is almost always a Dockerfile or CMD bug, not a daemon bug.

Q4. Why is RUN apt-get update && apt-get install better than two separate RUN lines? A. Because each RUN creates a separate cache layer. If apt-get update is in its own layer, BuildKit caches it indefinitely; the next build skips the update and pulls stale package indices into apt-get install, which then fails or installs old vulnerable versions. Chaining them in a single RUN ties their cache invalidation to the same key, so any change to the install list (or a no-cache rebuild) refreshes the package indices first. The same logic applies to pip and npm install pairs. Common wrong answer to avoid: "Fewer layers means a smaller image." Modern BuildKit collapses metadata cheaply; the real reason is cache correctness, not layer count.

Q5. Your container's healthcheck shows it as unhealthy in production, but it keeps running and serving traffic. Why? A. Because Docker's healthcheck reports status; it does not act on it. Plain Docker and plain compose do not restart unhealthy containers — they restart only on process exit. An orchestrator (Kubernetes liveness probes, ECS task health checks, Nomad checks) is what wires unhealthy state to traffic removal and replacement. If you are running compose in production, the fix is either an external watchdog (autoheal container) or moving to a real orchestrator. The healthcheck is still useful — load balancers can read it via docker inspect — but it is not self-acting. Common wrong answer to avoid: "Docker should restart unhealthy containers; it must be misconfigured." Docker has never restarted unhealthy containers; this is documented behaviour, not a bug.

Q6. How do you safely pass a private package-registry token to a Dockerfile? A. Use BuildKit secret mounts: RUN --mount=type=secret,id=npm_token,target=/root/.npmrc npm ci and provide the file from the host via docker buildx build --secret id=npm_token,src=$HOME/.npmrc .. The secret is exposed only to that RUN instruction and never lands in any layer or in docker history. Build args (ARG) and env vars are wrong because both persist in the image — docker history exposes ARGs, and the image config retains ENVs. SSH credentials follow the same pattern with --mount=type=ssh. Common wrong answer to avoid: "Use ARG TOKEN and then unset it in the same layer." docker history will still show the ARG value; the unset does not erase the layer.

Q7. When does docker compose stop being a viable production tool? A. The moment you need any of: multi-host scheduling, rolling updates with health-gated traffic shift, automatic recovery from unhealthy state, secret rotation without rebuilds, cross-AZ redundancy, or horizontal scaling beyond one machine. Single-host compose can run production workloads if you accept downtime during deploys and add external tooling for log/image cleanup, watchdogs for unhealthy containers, and disk-pressure monitoring. Past that, Kubernetes, Nomad, or ECS/Fargate are the right next steps. Compose is the friendly local-dev contract; production-grade orchestration is a different category. Common wrong answer to avoid: "Compose is only for development." Plenty of small production systems run on compose successfully; the question is which limitations bite you, not whether compose is "allowed" in production.

Q8. Your Python app inside a container does not respond to docker stop. Why, and how do you fix it? A. Because Python is running as PID 1 and the Linux kernel does not deliver SIGTERM (or most signals) to PID 1 unless the process explicitly installs a handler. CPython's default behaviour ignores SIGTERM. After 10 seconds Docker sends SIGKILL and the app dies hard, losing in-flight requests. Three fixes: (a) run the container with docker run --init which injects tini as PID 1; (b) use ENTRYPOINT ["tini", "--", "python", ...] in the Dockerfile; or (c) install a signal.signal(signal.SIGTERM, ...) handler in the app code. Production-grade ASGI servers like Uvicorn and Gunicorn handle SIGTERM correctly when invoked as the entrypoint — but only when they actually run as PID 1, not behind a sh -c wrapper. Common wrong answer to avoid: "Increase the stop grace period." That only delays the SIGKILL; it does nothing to make the app actually drain.

Apply now (10 min)¶

Step 1 — model the exercise. I will sketch the "Dockerfile audit" shape for one row of our ML service, so you can copy it.

Question	What I look for in our ML service	Red flag in a different image
What user does the app run as?	`USER app` (uid 10001), set before ENTRYPOINT	Image runs as root; no USER line
Is the base pinned?	`python:3.12-slim` with digest pinned for prod	`FROM python:latest`
Is there a multi-stage build?	Builder produces `/install`, runtime copies it	Single stage; ships build-essential and pip cache
Are secrets via ARG or mount?	Build uses `--mount=type=secret,id=hf_token`	`ARG HF_TOKEN` then `RUN ...` — leaks in history
Is PID 1 init-aware?	`ENTRYPOINT ["tini", "--", ...]` plus `STOPSIGNAL SIGTERM`	App is PID 1, no tini, ignores SIGTERM
Is there a meaningful HEALTHCHECK?	Hits `/healthz` with 20s start_period for model load	No HEALTHCHECK or one that always returns 0

Step 2 — your turn. Pick a Dockerfile from your own codebase or any project you have shipped. Fill the same six rows. Mark each green if the practice matches the section 2 + section 4 + section 6 rules, red otherwise. Three or more reds and the image is worth rewriting before the next deploy.

Step 3 — sketch from memory. Redraw the ARG vs SECRET — where the value ends up diagram from section 6 and the SIGTERM handling with and without tini diagram from section 4. Side by side. Label every arrow with what travels along it (value, signal, exit code). If you can do both cold, the build-secret hygiene and the PID-1 trap are now internalised.

Bridge. You can now write a Dockerfile that builds quickly, ships small, runs as a non-root user, handles signals correctly, and wires into a compose stack that mirrors production closely enough for the dev loop. The next chapter takes the four-container stack we just built and asks what breaks when it hits real networks and real disk pressure — bridge vs host vs overlay networks, bind mounts vs named volumes, and the production failure modes that wake you up at 2 a.m. → 03-networking-volumes-prod-gotchas.md