01. Images, layers, OCI — what an image actually is¶

~18 min read. You typed FROM python:3.12-slim a thousand times. You pulled, you tagged, you pushed. But what actually lives behind that name? A manifest, a config, and a tree of content-addressed blobs the kernel never sees as a single file. By the end of this page the image stops being a magical brick and becomes a thing you can reason about — and optimise — for a 2 GB ML model and a $30,000/year build farm.

Builds on: 00-eli5.md.

1) What an image actually is¶

Open a registry. Type docker pull python:3.12-slim. The client does not download a file called python:3.12-slim. There is no such file anywhere. What it downloads is a small JSON document called the manifest, then another small JSON document called the config, then a handful of compressed tarballs called layer blobs. Every one of those things is named by the SHA-256 hash of its own bytes. That hash is the digest. The tag python:3.12-slim is just a friendly pointer the registry happens to keep in a side-table. The truth of the image is the digest tree.

This matters because once you understand the tree, you understand caching, security signing, multi-arch builds, layer reuse, and the entire "why is my build slow" investigation. The OCI image specification — the open standard the Docker image format was rebased onto in 2017 — defines exactly four kinds of objects: descriptors (typed pointers carrying a digest, a media type, and a size), manifests (lists of descriptors that describe one image for one platform), image configs (JSON blobs that describe the rootfs and the runtime defaults), and layer blobs (the gzipped tarballs that, stacked in order, form the rootfs). That is the entire model.

OCI IMAGE MANIFEST TREE (single-platform)
─────────────────────────────────────────

   tag: python:3.12-slim ─────────► (registry side-table, mutable)
                                            │
                                            ▼
   ┌───────────────────────────────────────────────────────────┐
   │  MANIFEST  (sha256:b7f4...)                               │
   │  mediaType: application/vnd.oci.image.manifest.v1+json    │
   │  ┌─────────────────────────────────────────────────────┐  │
   │  │ config descriptor                                   │  │
   │  │   digest: sha256:9c1a...  size: 2 KB                │──┼──► CONFIG JSON
   │  │   mediaType: ...image.config.v1+json                │  │    (env, cmd, rootfs.diff_ids[],
   │  └─────────────────────────────────────────────────────┘  │     history, architecture)
   │  ┌─────────────────────────────────────────────────────┐  │
   │  │ layers[0..N] descriptors                            │  │
   │  │   layer 0  digest: sha256:31b3...  size: 29 MB      │──┼──► BLOB (debian rootfs tar.gz)
   │  │   layer 1  digest: sha256:7e02...  size: 4 MB       │──┼──► BLOB
   │  │   layer 2  digest: sha256:c1de...  size: 11 MB      │──┼──► BLOB (python install)
   │  │   layer 3  digest: sha256:aaaa...  size: 800 MB     │──┼──► BLOB (pip install -r req.txt)
   │  │   layer 4  digest: sha256:bbbb...  size: 2 KB       │──┼──► BLOB (COPY ./app)
   │  └─────────────────────────────────────────────────────┘  │
   └───────────────────────────────────────────────────────────┘

Notice three things on this picture. First, every arrow carries a digest and a size — those two fields are how the client knows what to fetch and how to verify it. Second, the config is the thing that ties layer order, environment variables, default command, and architecture together; it is not one of the layers, even though its digest sits inside the manifest. Third, the manifest's own digest is what you should pin in production. The tag can be moved; the digest cannot.

Teacher voice. Tags are like sticky notes. Digests are like fingerprints. If you deploy by tag, you are deploying whatever fingerprint the sticky note happens to point at this minute. If you deploy by digest, you are deploying the exact bytes you tested.

2) How layers stack at runtime¶

Now the running container. Each layer blob, once pulled, is unpacked into a directory on disk — somewhere like /var/lib/docker/overlay2/<id>/diff. These directories are read-only. When you docker run an image, the runtime creates one more directory on top — the writable upper layer — and asks the Linux kernel to glue them all together using a union filesystem. On modern Docker that union filesystem is overlayfs, exposed via the overlay2 storage driver. The container process believes it is looking at a single root filesystem. The kernel is actually serving reads from the lowest layer that contains each file and routing writes to the upper layer.

Meet our worked example. You are containerising a Python ML inference service that loads a 2 GB model file. The service is a FastAPI app, depends on torch, transformers, sentencepiece, and numpy, and ships behind an Nginx sidecar. The model is a fine-tuned sentence-transformers checkpoint stored on S3. The whole thing must run on both linux/amd64 (your dev laptops are Apple Silicon, your prod nodes are m5.large) and survive being killed and restarted 50 times an hour by your autoscaler. This service threads through every section that follows.

LAYERS ON DISK (read-only)             UNION VIEW INSIDE CONTAINER
──────────────────────────             ───────────────────────────

  /var/lib/docker/overlay2/
                                                /
   ┌──────────────────────────┐                 ├── bin/ ◄── debian
   │ L0  debian:slim rootfs   │  ┐              ├── etc/ ◄── debian
   │     /bin /etc /lib ...   │  │              ├── lib/ ◄── debian
   └──────────────────────────┘  │              ├── usr/
                                 │              │   ├── local/bin/python ◄── L2
   ┌──────────────────────────┐  │              │   └── local/lib/python3.12/
   │ L1  apt deps             │  │              │       └── site-packages/
   │     libgomp1, libssl3    │  │  overlayfs   │           ├── torch/      ◄── L3
   └──────────────────────────┘  ├──── merge ──►│           ├── transformers/ ◄── L3
                                 │              │           └── numpy/      ◄── L3
   ┌──────────────────────────┐  │              ├── app/
   │ L2  python 3.12          │  │              │   ├── main.py ◄── L4
   └──────────────────────────┘  │              │   └── model.bin ◄── L5 (2 GB)
                                 │              └── tmp/             ◄── upper (writable)
   ┌──────────────────────────┐  │                   └── inference.log
   │ L3  pip install reqs.txt │  │
   │     torch, transformers  │  │
   └──────────────────────────┘  │
                                 │
   ┌──────────────────────────┐  │
   │ L4  COPY ./app /app      │  │
   └──────────────────────────┘  │
                                 │
   ┌──────────────────────────┐  │
   │ L5  ADD model.bin /app/  │  │   ◄── 2 GB, biggest layer
   └──────────────────────────┘  │
                                 │
   ┌──────────────────────────┐  │
   │ UPPER (per-container)    │  ┘
   │ writable; logs, /tmp     │
   └──────────────────────────┘

Two crucial properties fall out of this. Copy-on-write: the moment the container modifies any file, overlayfs copies that file up into the writable layer and writes it there. The original in the lower layer is untouched. Ten replicas of the same container share one on-disk copy of every read-only layer; only their upper layers diverge. Shared blobs across images: if you build two images that both FROM python:3.12-slim, the four base layers exist exactly once on disk and exactly once in the registry. The registry only ships layers the client does not already have, indexed by digest. That is why a fresh docker pull of an image with mostly-shared layers is so fast.

Mini-FAQ. "If layers are read-only, how does my app write logs?" The writes go to the upper layer, which is per-container and gone when the container is removed. For durable writes, you mount a volume — chapter 3 covers that.

3) Content-addressable storage and the registry¶

Why is everything named by hash? Because content-addressable storage gives you three properties for free: deduplication, verification, and immutability. Two layers with byte-identical contents have the same digest, so the registry stores one copy and serves both. A client can verify a downloaded layer by hashing the bytes itself — if the hash differs, the bytes are wrong. And nobody can quietly change "the python:3.12-slim image I tested last week" because the digest is the bytes; change the bytes and the digest changes too.

The registry, defined by the OCI distribution specification, is essentially a content-addressed blob store with a thin tag index on top. Every push uploads blobs first, then a manifest that points at them by digest. Every pull is the reverse — fetch the manifest, then fetch only the blobs the local cache does not already have. The wire protocol is dull HTTP. The cleverness is entirely in the digest discipline.

For multi-arch images the manifest is wrapped in one more level: an image index (also called a manifest list) is itself a manifest whose entries are platform-specific manifests. When you docker pull python:3.12-slim on Apple Silicon, the client fetches the index, picks the entry whose platform: {os: linux, architecture: arm64} matches, and only then pulls that platform's manifest and blobs. Your prod cluster does the same with amd64. Same tag, two completely different layer trees, selected at pull time. docker buildx build --platform linux/amd64,linux/arm64 --push is what produces that index.

Teacher voice. A registry is not a file server. It is a Merkle DAG with HTTP wrapping. Treat it that way and the whole system makes sense — including why pinning by digest is the only honest way to deploy.

4) Where caches help and hurt¶

The Dockerfile build cache is the single biggest lever you have on build speed. Each instruction in a Dockerfile produces a layer; BuildKit hashes the instruction together with its inputs (the previous layer's digest, plus any files it COPYs in, plus the literal command text) and looks that hash up in a cache. Hit, and the cached layer is reused. Miss, and the layer is rebuilt — and so is every layer after it. That last clause is the whole game.

Watch what this means for the ML inference service. A naive Dockerfile looks like this:

FROM python:3.12-slim
COPY . /app
RUN pip install -r /app/requirements.txt
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0"]

Every time you change any file in your repo — even a comment in README.md — the COPY . /app step's input hash changes. The layer cache invalidates. The next instruction, pip install, reruns from scratch, pulling 1.5 GB of torch wheels off PyPI. Your CI build that should be 20 seconds takes 8 minutes. The fix is to copy only the requirements file first, install, and then copy the code:

FROM python:3.12-slim
COPY requirements.txt /app/requirements.txt
RUN pip install -r /app/requirements.txt
COPY . /app
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0"]

Now editing app/main.py invalidates only the second COPY. The pip install layer stays cached. Twenty-second builds again. Simple, no?

BuildKit, the modern build backend that ships with Docker 23.x and later, gives you two more knobs. --cache-from and --cache-to export and import the layer cache across machines — essential for CI, where every job starts on a fresh runner with an empty cache; with cache-to: type=gha,mode=max against the GitHub Actions cache, teams routinely cut warm builds from 8 minutes to 1. And cache mounts — RUN --mount=type=cache,target=/root/.cache/pip pip install ... — persist a build-time scratch directory across builds without putting it into the final image. The pip download cache lives in /root/.cache/pip; mount it and a layer rebuild redownloads nothing. The cache mount is invisible to the image — it does not become a layer, it does not bloat the final size — but it makes the rebuild 10× faster.

Mini-FAQ. "Will the cache mount end up in the image?" No. It exists only during the build. The final layer contains the installed packages under /usr/local/lib/python3.12/site-packages/, not the wheel cache.

5) Multi-stage builds and the OCI image spec¶

Now the model file. Two GB of weights cannot live in requirements.txt. You probably download it during build, or — better — at container start from S3. Either way, the build of the image often pulls in tools (compilers, git, curl, build-essential, sometimes CUDA toolkit) that the runtime does not need. Shipping those tools is a security and bandwidth tax. Multi-stage builds solve this by letting one Dockerfile produce multiple intermediate images and copy only the artefacts from one stage into the next.

Here is a multi-stage Dockerfile for the inference service that demonstrates layer ordering, cache mounts, and stage separation in one go:

# syntax=docker/dockerfile:1.7
# ---------- builder stage ----------
FROM python:3.12 AS builder
WORKDIR /build
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential git && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --prefix=/install -r requirements.txt

# ---------- runtime stage ----------
FROM python:3.12-slim AS runtime
WORKDIR /app
COPY --from=builder /install /usr/local
COPY ./app /app
# model fetched at start by entrypoint, NOT baked into the image
ENV MODEL_S3_URI=s3://ml-prod/models/embed-v3.bin
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

The builder stage is allowed to be large — it has build-essential, compiled wheels, git, the lot. None of that ships. The final image is built FROM python:3.12-slim and only inherits the installed packages via COPY --from=builder /install /usr/local. The 2 GB model is intentionally not baked into the image; it is fetched at container start from S3 into a mounted volume. Two reasons for that choice. One, baking a 2 GB layer into the image means every cold-start pull moves 2 GB across the network — terrible for autoscaling. Two, retraining means rebuilding the image, even if no application code changed. Decouple model from image. Treat the model like data, not like code.

The OCI image spec accommodates this cleanly. The manifest does not care that some layers came from a different stage; it only sees the final set of layers and the final config. The stage boundary is a Dockerfile concept, not an OCI concept. BuildKit is what knows to throw away the intermediate stages.

Teacher voice. Multi-stage is not just for smaller images. It is for honest images. The thing you ship is the thing you tested, with no extra compilers, no apt cache, no git history lying around to extend the attack surface.

6) Storage drivers and the layer-count ceiling¶

Overlayfs has a hard limit. The overlay2 driver natively supports up to 128 lower layers stacked in one mount — though Docker's own docs and a long-running upstream issue note the practical ceiling is 127 because one slot is consumed by the upper layer. Cross that and docker run fails with max depth exceeded. Hit it in production and your cluster cannot start the image at all.

Most teams never see this on their hand-written Dockerfiles. They see it the day they auto-generate Dockerfiles or rebase frequently onto a base image that itself has 40+ layers. It also bites users of docker commit workflows — every commit adds a layer; commit your way through a debug session and you can casually punch through the ceiling.

Two countermeasures. Coalesce RUN lines. Five RUN apt-get install statements become one with && between them — five layers become one. Use multi-stage builds. The runtime stage starts fresh; whatever layer count the builder accumulated is discarded. For the ML service, our final image is six layers: slim base (4 layers inherited from python:3.12-slim), the packages copy, the app copy. Comfortably under the ceiling.

Inodes are the other hidden cliff. Each layer consumes inodes on the backing filesystem; a node with thousands of images and high layer counts can run out of inodes before it runs out of disk space. The kernel returns ENOSPC even though df -h shows half the disk free. ext4 with -N to raise the inode count, or xfs which allocates inodes dynamically, are the production-grade choices.

7) Cost/comparison table¶

Concrete numbers, qualified by stack. All sizes measured on Docker 24.x with BuildKit enabled, pulled to a fresh m5.large (linux/amd64) in May 2026. Cold-start latency is wall-clock from docker run to first HTTP 200 OK from a "hello world" FastAPI app baked into each base, averaged over five runs.

Base image	Compressed size	On-disk size	Layer count	Cold start (s, p50)	Notes
`python:3.12` (Debian full)	~370 MB	~1.02 GB	8	2.1	Full toolchain. Use for builder stages only.
`python:3.12-slim` (Debian slim)	~45 MB	~150 MB	4	1.3	Default safe choice for runtime.
`python:3.12-alpine`	~17 MB	~55 MB	4	1.1	musl libc; some C-extension wheels recompile from source.
`gcr.io/distroless/python3-debian12`	~25 MB	~50 MB	2	0.9	No shell, no apt; debug via `:debug` tag.
`cgr.dev/chainguard/python:latest`	~22 MB	~45 MB	2	0.9	Wolfi-based; near-zero CVE count tracked.
`scratch` + static Go binary	~10 MB	~10 MB	1	0.4	Only viable for fully-static binaries; no Python.

The pattern is monotonic and predictable. Smaller bases mean fewer layers, less data over the wire, faster cold starts, smaller attack surface — and a corresponding increase in build effort and debugging friction. For the ML inference service the right answer for the runtime stage is python:3.12-slim or distroless; the 2 GB model dominates total image size only if you make the mistake of baking it in, which section 5 already argued you should not.

Where this lives in the wild¶

The same image-layer mechanics drive two very different optimisation pressures in production. Some teams optimise the build cache because their bottleneck is developer iteration time and CI throughput. Others optimise the final image size and layer count because their bottleneck is cold-start latency, attack surface, or registry egress cost.

Companies and projects optimising layer cache for build speed and throughput:

Uber Makisu — Uber's open-source builder runs in unprivileged Kubernetes pods and shares a distributed Redis-backed layer cache across the build cluster, cutting build times by up to 90% for some repos and producing thousands of images a day across four primary languages.
Uber Kraken — Uber's peer-to-peer registry distributes layer blobs between hosts via torrent-style swarming; in its busiest cluster it serves more than 1 million blobs per day, including 100k blobs over 1 GB.
GitHub Actions cache backend (type=gha) — the official Docker cache-to: type=gha,mode=max backend stores BuildKit layer and cache-mount data inside the GitHub-hosted cache; teams like HyperDX report warm builds dropping from 8 minutes to 1 minute.
Blacksmith — a CI provider whose entire pitch is faster Docker layer caching via persistent volumes attached to runners, eliminating the GHA 10 GB cache eviction problem.
Depot.dev — managed BuildKit infrastructure with persistent layer cache across builds; popular with teams whose default GitHub Actions builds were spending most of their time on pip install and npm ci.
moby/buildkit — the build engine itself, embedded in Docker 23.x+, which introduced cache mounts (RUN --mount=type=cache) so package-manager downloads survive layer-cache invalidation without bloating the image.
GitLab Runner Docker executor — spins up a fresh container per CI job, using cache-from/cache-to against a registry to share layers between job containers on different physical runners.
Replit — every user repl runs in a container; the platform aggressively reuses base layers across the millions of repls created so users do not each pay for a fresh image pull.
Hugging Face Spaces — builds an image for every user demo; layer reuse across the standard gradio, streamlit, and docker Space templates is what keeps the build farm tractable.
Vercel preview builds — every preview deploy is effectively a fresh image; layer caching is what keeps preview build times bearable across thousands of pushes a day.
Netflix Titus build pipeline — Titus, Netflix's container platform, launches over three million Docker containers per week; the upstream image build pipeline relies on layer-level cache reuse to make daily releases of those images survivable.
NVIDIA NGC catalog — the nvcr.io/nvidia/cuda base layer is shared across every framework image (PyTorch, TensorFlow, Triton), so a host that has pulled one of them already has the bulky CUDA layer for the others.

Companies and projects optimising final image size or layer count for cold start and security:

Google Distroless — gcr.io/distroless/static-debian13 is around 2 MiB; Google publishes language-specific distroless variants (Python, Java, Node) with no shell or package manager, used internally and across the industry to cut Node images from ~1 GB to ~150 MB.
Chainguard Images — over 2,000 Wolfi-based images with near-zero CVE counts; adopted in production by Anduril, Canva, Cyera, Snowflake, and Wiz; the company has remediated more than 54,000 CVEs across customer fleets.
Cloudflare Workers + Containers — Workers ships static binaries on a tiny edge runtime; Cloudflare Containers (built on the same image model) emphasises minimal linux/amd64 images for fast cold start on the edge.
AWS Lambda container runtime — Lambda pulls container images from ECR on first cold start and caches them at the edge per availability zone; the AWS docs explicitly recommend keeping the image manifest under 25,400 bytes and minimising layers for fastest cold start.
Fly.io Machines — Fly's Firecracker-backed Machines boot from OCI images; their performance numbers (~3-second cold starts for a small image) hinge on small layer counts and small total image size.
AWS Fargate — Fargate pulls the image to a fresh micro-VM on every cold start; image size directly translates to pod start time, which is why teams use slim or distroless bases on Fargate.
Go static binary + FROM scratch — the canonical 5–15 MB image for Go services, used heavily inside Cloudflare, Tailscale (the tailscale/tailscale container is multi-stage with a tiny final stage), and many CNCF projects.
Tailscale's container image — distributed as a multi-stage image whose final stage is Alpine plus the single static tailscaled binary, sized for fast pull onto ephemeral nodes.
Spotify Backstage — distributed as a Docker image so 280+ internal teams can self-host the catalogue; multi-stage Dockerfile keeps the final runtime stage on a slim base.
Pinterest API fleet — dockerised 100% onto Amazon ECR with replicated secondary registry; their published guidance to internal teams pushes slim bases to keep pull times bounded.
PayPal — containerised 700+ applications onto Docker Enterprise running over 200,000 containers in production, with reported deploy time reductions of roughly 90% attributable in large part to image-size discipline.
Bloomberg ML platform — runs training and inference inside containers on Kubernetes; KServe serves models from container images where small runtime stages plus mounted model volumes (rather than baked-in weights) keep the images deployable.

Twenty-three real production cases. Notice the split: the build-cache camp optimises against developer time and CI cost; the image-size camp optimises against cold-start latency and CVE risk. Both come back to the same Merkle DAG.

Pause and recall¶

What three OCI object types make up an image, and which one carries the architecture and default command?
In the inference-service layer diagram, which layer would invalidate first if you only changed app/main.py, and which would invalidate if you only changed requirements.txt?
Why does pinning by digest defeat the "someone moved my tag" problem in a way that pinning by tag does not?
What is the difference between a layer (which ships in the final image) and a BuildKit cache mount (which does not)?
Why does coalescing five RUN apt-get install lines into one help with the overlayfs layer ceiling, but not necessarily with image size?
For the 2 GB ML model, why is fetching it at container start preferable to baking it as a final layer?
What does a multi-arch image's index look like, and at what point does the client decide which platform's manifest to pull?
Roughly what factor of size reduction does switching from python:3.12 (full) to python:3.12-slim deliver, and what would push you further down to distroless?

Interview Q&A¶

Q1. What is the difference between an image tag and an image digest, and which should production deploys reference?

A. A tag is a mutable, human-friendly pointer kept in the registry's side-table; python:3.12-slim today and python:3.12-slim tomorrow may resolve to entirely different bytes. A digest is the SHA-256 of the manifest's bytes — it is the manifest. Production deploys should reference the digest (python@sha256:b7f4...) so the exact bytes you tested in CI are the exact bytes that run in prod. Tags are fine for development convenience.

Common wrong answer to avoid: "Tags and digests are interchangeable, the registry resolves both." They are not — tags are mutable, digests are immutable by construction.

Q2. What does the Dockerfile build cache actually key off?

A. BuildKit computes a hash from the previous layer's digest, the literal instruction text, and (for COPY/ADD) the contents of the files being copied. That hash is the cache key. A cache hit means the layer is reused as-is; a miss means it is rebuilt — and every layer below it is rebuilt too, because their cache keys depend on the parent digest. This is why instruction order matters: put rarely-changing things (system packages, language-runtime install) above frequently-changing things (app code).

Common wrong answer to avoid: "It caches on the instruction text only." That misses the file-contents input for COPY/ADD, which is the whole reason COPY requirements.txt before COPY . works.

Q3. Why do multi-stage builds produce smaller images?

A. Because only the final stage's layers ship. Intermediate stages can carry compilers, build tools, package caches, and source artefacts the runtime never needs; the final stage starts from a slim or distroless base and uses COPY --from=<stage> to pull in only the built artefacts. The OCI manifest only sees the final stage's layer list — the intermediate stages are a BuildKit-internal concept and are discarded.

Common wrong answer to avoid: "Because multi-stage compresses layers harder." Compression has nothing to do with it. The size saving comes from not shipping the build toolchain.

Q4. What is the overlayfs layer ceiling, and how do you stay under it?

A. The overlay2 storage driver supports up to 128 lower layers; the practical ceiling is 127 because the upper writable layer consumes one slot. You stay under it by coalescing RUN lines with &&, using multi-stage builds so the final stage starts from a clean base, and avoiding workflows like docker commit that pile on layers without producing useful boundaries. The error is max depth exceeded at docker run time and it is unrecoverable without rebuilding the image.

Common wrong answer to avoid: "There's no limit, you can have as many layers as you want." There is, and finding it in production at 3 a.m. is unpleasant.

Q5. How does a multi-arch image actually work on the wire?

A. A multi-arch image is an OCI image index (also called a manifest list) whose entries each describe one platform-specific manifest (e.g., linux/amd64, linux/arm64) with the platform-specific manifest's digest, media type, and a platform descriptor. When a client pulls the tag, the registry returns the index. The client picks the entry whose platform matches its host, then fetches that platform's manifest and only its blobs. docker buildx build --platform linux/amd64,linux/arm64 --push is the canonical way to produce the index; older docker manifest create is experimental and discouraged for production.

Common wrong answer to avoid: "Docker builds a fat image with both architectures inside." There is no fat image; the index is a tiny JSON pointing at two completely separate manifests.

Q6. Why is a BuildKit cache mount preferable to a layer for things like the pip download cache?

A. Because the cache mount lives only during the build and never becomes a layer in the final image. A RUN pip install ... step caches the installed packages into a layer (good — you want them in the image), but the wheel downloads under /root/.cache/pip are useless at runtime; if you let them land in a layer you ship them too. Cache mounts let the wheel downloads persist across builds (so re-installing is fast) without polluting the image. The trade-off is that the cache lives on the builder host, so it is per-machine unless you use a remote cache backend.

Common wrong answer to avoid: "Cache mounts are just a faster way to write a layer." They are explicitly the opposite — they do not produce layers.

Q7. How would you architect the image for a Python ML service that loads a 2 GB model file?

A. Multi-stage build: builder stage on full python:3.12 with build-essential to compile any wheels with C extensions, runtime stage on python:3.12-slim or distroless. Use COPY --from=builder /install /usr/local to bring over installed packages. Do not bake the model into the image — fetch it at container start from S3 or a model registry into a mounted volume. This decouples model versioning from image versioning, keeps cold-start pulls small (so autoscaling stays cheap), and means retraining does not require an image rebuild. Layer order: base, system deps, requirements install, app code, entrypoint — top to bottom by frequency of change, so the heavy pip install layer stays cached across most rebuilds.

Common wrong answer to avoid: "Bake the model into the image so the container is self-contained." That sounds clean and is operationally awful — every cold start now moves 2 GB across the network and every retrain rebuilds and re-pushes a 2 GB-larger image.

Q8. Two engineers see different layer digests for what they claim is the "same" image. How do you debug?

A. The image is not the same — by definition, different digests mean different bytes. Walk the manifest tree to localise the diff. docker buildx imagetools inspect <ref> shows the manifest and its layer descriptors; compare layer-by-layer to find which layer's digest differs. Then inspect the differing layer (docker history, or extract the blob and diff its tar contents). Common causes: a COPY . that picked up a different .git directory or a different timestamp on a generated file; a base image tag that moved between the two builds; non-deterministic build steps like pip install without a lockfile pulling different transitive deps. The fix is reproducible builds — pin base images by digest, use lockfiles, and avoid timestamp-sensitive steps.

Common wrong answer to avoid: "They're probably the same, the digests are just formatted differently." Digests are formatted exactly one way (sha256:<64 hex chars>); if they differ, the bytes differ.

Apply now (10 min)¶

Step 1 — model the exercise. Take the ML inference service from section 2. Here is a one-row example of a layer-ordering audit, the shape you will fill in for your own service:

Dockerfile step (current order)	Frequency of change	Belongs above or below `pip install`?
`COPY requirements.txt .`	low (weekly)	above (input to pip install)
`RUN pip install -r requirements.txt`	low (weekly)	this row is the anchor
`COPY ./app /app`	high (per commit)	below (do not invalidate pip install)
`ENV MODEL_S3_URI=...`	rare	above (config, never re-evaluated)
`ADD model.bin /app/model.bin`	medium (per retrain)	do not include — fetch at start

Step 2 — your turn. Pick a real Dockerfile from a service you own. List every instruction in current order, label each with frequency of change (per-commit / weekly / rarely), and re-order so that frequency increases monotonically from top to bottom. Note any instruction that is a candidate for a multi-stage split (compilers, large build-only deps) or for a BuildKit cache mount (package-manager caches).

Step 3 — sketch from memory. Redraw the OCI image manifest tree diagram from section 1. Label every arrow with the field that carries the digest (descriptor, config descriptor, layer descriptor). If you can do this cold, you understand what pulling an image actually means.

Bridge. We now know what an image is — manifest, config, layers, all digest-addressed. The next chapter opens the file you spend most of your day editing — the Dockerfile — plus docker compose for local stacks, and shows the idioms that turn that internals knowledge into shipping code. → 02-dockerfile-compose-day-to-day.md