01. Why dense pixels fail first — the network forgets the picture before it learns¶

~12 min read. The tempting baseline that explodes in size and throws away layout.

Built on the ELI5 in 00-eli5.md. The the patch — image token, a small square chunk of the image, like a brush stroke — matters here because flattening ignores the local neighborhoods that a vision system should respect.

1) The tempting bad idea¶

Suppose you start with one RGB image of size 224 × 224 × 3. That gives 224 × 224 × 3 = 150,528 raw input values. A beginner says, "Fine. Feed that whole vector into a dense layer." On the surface, that sounds clean. No special tricks. No image-specific design. Just numbers in, numbers out.

Now choose a first dense layer with 512 neurons. Each neuron connects to all 150,528 input features. So the weights alone are: 150,528 × 512 = 77,070,336 Then add 512 bias terms. Total parameters become: 77,070,336 + 512 = 77,070,848 That is about 294 MB in float32 for this one layer alone. Not the full model. Just the opening move.

See the problem. The network has not learned edges, corners, or shapes yet. Still, it has already become huge. Training cost rises. Data hunger rises. Overfitting risk rises. And this cost comes after we already damaged the image layout. So the opening is expensive and clumsy together.

2) Flattening destroys the map¶

An image is a grid. Meaning lives in neighborhoods. A cat ear is a local curve. A lane marker is a local bright line. A face contour is a local boundary. When we flatten, we turn a map into a list. After that, space becomes hard to recover.

2D image grid                        flattened vector

┌────┬────┬────┬────┐               ┌────┬────┬────┬────┬────┬────┬────┬────┐
│ A  │ B  │ C  │ D  │               │ A  │ B  │ C  │ D  │ E  │ F  │ G  │ H  │
├────┼────┼────┼────┤      ──→      ├────┼────┼────┼────┼────┼────┼────┼────┤
│ E  │ F  │ G  │ H  │               │ I  │ J  │ K  │ L  │ M  │ N  │ O  │ P  │
├────┼────┼────┼────┤               └────┴────┴────┴────┴────┴────┴────┴────┘
│ I  │ J  │ K  │ L  │
├────┼────┼────┼────┤               Lost after flattening:
│ M  │ N  │ O  │ P  │               A is above E.
└────┴────┴────┴────┘               D is above H.
                                     Those vertical links are not explicit anymore.

In the grid, A is near B and also near E. In the vector, A stays next to B, but its vertical relation with E is no longer special. Tomorrow a different flatten order could move E elsewhere. So adjacency becomes an accident of serialization. Simple, no?

A dense layer treats pixel 1 and pixel 95,301 as just coordinates. It does not know which ones formed a corner together. It does not know which ones belonged to one edge. It must relearn locality from scratch. That is a weak start for the eye. A real visual system needs geography, not just magnitude.

3) Humans scan in chunks, not spreadsheets¶

Look at how you inspect a photo. You do not first compare the top-left sky pixel with a random shoe pixel. You notice local texture. You notice the curve of a cup rim. You notice a wheel arch. You notice eyebrow shape. These are neighborhood patterns.

That is why the patch is a sane first unit. A patch keeps nearby pixels together long enough for the model to form a local summary. This is closer to how the eye behaves. Local first. Then global composition. No one is saying the human retina runs a transformer. The point is simpler. Vision begins with local grouping. Flattening does the opposite. It throws local grouping away and hopes later layers rebuild it. Yes?

That repair is possible, but costly. The model needs more parameters. It usually needs more data. It wastes learning capacity on rediscovering something the input already had. So what to do? Keep the local structure alive at the start. That is the job of the patch.

4) Worked example: naive dense versus patch-based opening¶

Take the same 224 × 224 × 3 image. We compare two openings. One is naive dense. One is patch-based.

Option A: naive dense opening¶

Raw features: 224 × 224 × 3 = 150,528 Dense width: 512 Weights: 150,528 × 512 = 77,070,336 Bias: 512 Total: 77,070,336 + 512 = 77,070,848

Now memory for just those parameters in float32: 77,070,848 × 4 bytes = 308,283,392 bytes That is roughly 294 MB. Only the first layer. Not activations. Not gradients. Not optimizer states.

Option B: patch-based opening with `16 × 16` patches¶

Patch size in values: 16 × 16 × 3 = 768 Patch count per side: 224 ÷ 16 = 14 Total patches: 14 × 14 = 196 Projection dimension: 512 Shared projection weights: 768 × 512 = 393,216 Bias: 512 Total projection parameters: 393,216 + 512 = 393,728

Notice the trick. The same projection matrix is reused for all 196 patches. We do not learn 196 separate projection matrices. So the parameter ratio is: 77,070,848 ÷ 393,728 ≈ 195.8 The naive opening is about 196× larger. And still less respectful of image structure. That is the important part.

5) The real lesson¶

The lesson is not just, "dense is big." The deeper lesson is, "vision has layout, so the first layer should respect layout." Convolutions do this. Patch embeddings do this. Dense flattening does not.

A big dense network can still learn useful features. But it starts from the wrong assumption. It assumes the image is a bag of numbers. It is not. It is a spatial object. That is why the eye prefers structure-aware openings. And that is why the patch becomes the next step.

Where this lives in the wild¶

Apple Face ID on iPhone — perception engineer: keeps local facial geometry intact because eyelid edges and nose contours are neighborhood signals.
Tesla Autopilot camera stack — perception engineer: preserves spatial layout so lane paint, curbs, and brake lights stay localized before scene reasoning.
Google Photos visual search — vision ML engineer: image encoders extract structured local features before matching receipts, pets, and landmarks.
Adobe Firefly prompt-to-image stack — multimodal engineer: uses structured image encoders before linking visual representations to generation systems.
Meta Segment Anything image encoder — research engineer: converts images into spatial tokens so masks align with meaningful regions, not a scrambled vector.

Pause and recall¶

Why does a 224 × 224 × 3 image produce 150,528 raw features?
Why does a first dense layer with 512 neurons cross 77 million parameters?
What spatial relation disappears when a 2D image is flattened into one vector?
In the worked example, why is the patch projection much smaller even with 196 patches?

Interview Q&A¶

Q: Why is flattening raw pixels before a dense layer a weak inductive bias for vision, not just a memory problem? A: Because it removes explicit neighborhood structure, so the model must relearn locality instead of starting with image-aware units. Common wrong answer to avoid: "It is bad only because GPUs cannot hold the weights."

Q: Why reuse one patch projection across all patches instead of learning a separate dense map for every patch location? A: Because sharing keeps parameters low and lets similar local patterns be recognized anywhere in the image. Common wrong answer to avoid: "Because every patch is identical, so position never matters."

Q: Why can a giant dense layer still underperform a structured encoder even when both have enough capacity? A: Because capacity does not replace inductive bias; the structured encoder spends its learning budget on useful visual regularities sooner. Common wrong answer to avoid: "More parameters always recover the same structure with no tradeoff."

Q: Why do modern vision systems keep local grouping early and global mixing later? A: Because edges, textures, and object parts form locally first, and scene-level meaning depends on composing those local cues. Common wrong answer to avoid: "Global reasoning should always happen before local feature extraction."

Apply now (5 min)¶

Quick exercise. Take a toy 8 × 8 × 3 image and compute the raw feature count. Then compute the dense parameter count for width 64. Next, split the same image into 4 × 4 patches and compute one shared patch projection into dimension 64. Compare the totals.

Sketch from memory a 4 × 4 grid becoming one long vector. Mark two neighbor relations that vanish after flattening. Under the sketch, write one line on why the patch protects local meaning better than a giant first dense layer.

Bridge. Flattening kills spatial structure. So what to do? Cut the image into small squares instead. → 02-vision-encoders-vit.md