Skip to content

02. Vision Transformers in plain sight — how small squares become visual tokens

~14 min read. The clean recipe that turns pixels into a sequence the model can reason over.

Built on the ELI5 in 00-eli5.md. The the eye — vision encoder, the part that sees raw pixels and outputs numbers — works here by treating each small square as a first-class unit instead of flattening the whole image at once.


1) The basic move: cut first, reason later

We already saw why flattening is a bad opening. So what to do? Cut the image into small squares. Each square becomes the patch. Then let the encoder process a sequence of patches. This is the core move in a Vision Transformer, or ViT.

For a standard image of 224 × 224 × 3, a common patch size is 16 × 16. That means each patch holds: 16 × 16 × 3 = 768 pixel values. Instead of one giant vector of 150,528 values, we create many smaller patch vectors. Simple, no?

How many patches do we get? 224 ÷ 16 = 14 patches along height. 224 ÷ 16 = 14 patches along width. So total patch count is: 14 × 14 = 196 Now the image is a sequence of 196 units. That is much nicer for a transformer.

The key idea is not magic. A transformer expects tokens. Language gives word tokens. Vision gives patch tokens. That is why the patch becomes the basic unit inside the eye.

2) Patch count versus pixel count

Look at the compression in unit count. The original image has 150,528 scalar values. The ViT opening turns that into 196 patch tokens. Important detail. We are not deleting information randomly. We are regrouping information into local chunks.

Each token still starts from real pixels. But the reasoning unit changes. Instead of asking attention to work over every individual pixel, we ask it to work over patches. That reduces sequence length dramatically.

Compare the two counts directly: - Raw pixel scalars: 150,528 - Patch tokens: 196

Now compare attention cost intuition. If you tried self-attention over 150,528 tokens, the pairwise interactions would be absurd. With 196 tokens, self-attention is practical. That is one reason ViTs are even possible.

See this pipeline picture.

input image 224×224×3
┌───────────────────────────────┐
│ split into 16×16 squares      │
└───────────────────────────────┘
196 patches total
        ├── patch 1: 16×16×3 ──→ flatten 768 ──→ linear projection ──→ token z1
        ├── patch 2: 16×16×3 ──→ flatten 768 ──→ linear projection ──→ token z2
        ├── patch 3: 16×16×3 ──→ flatten 768 ──→ linear projection ──→ token z3
        └── ...
add positional embeddings
transformer blocks
visual embedding from the encoder

So the model does two things at once. It keeps local pixels together inside one token. And it makes the sequence short enough for global mixing. Yes?

3) Worked example: trace one patch through the pipeline

Let us follow one concrete patch. Take patch p_17 from the image. It is a 16 × 16 × 3 square. After flattening, it becomes a vector of length 768. Suppose the first six values are: [12, 18, 20, 25, 30, 10, ...] And suppose we normalize pixel values by dividing by 255. Then the first six normalized values become: [0.0471, 0.0706, 0.0784, 0.0980, 0.1176, 0.0392, ...] The full vector still has length 768.

Now project this patch into model dimension D = 4 for a toy example. Real ViTs use much larger D, but 4 is easier to see. Assume the first four projection rows are:

w1 = [0.10, 0.20, 0.00, -0.10, 0.05, 0.10, ...] w2 = [0.00, -0.10, 0.20, 0.10, 0.05, 0.00, ...] w3 = [0.30, 0.10, -0.20, 0.00, 0.10, 0.05, ...] w4 = [-0.20, 0.00, 0.10, 0.20, -0.10, 0.10, ...]

To keep arithmetic visible, use only the first six inputs and imagine the rest contribute 0. Then each projected coordinate is a dot product.

z1 = 0.10×0.0471 + 0.20×0.0706 + 0.00×0.0784 + (-0.10)×0.0980 + 0.05×0.1176 + 0.10×0.0392 z1 = 0.00471 + 0.01412 + 0.00000 - 0.00980 + 0.00588 + 0.00392 z1 = 0.01883

z2 = 0.00×0.0471 + (-0.10)×0.0706 + 0.20×0.0784 + 0.10×0.0980 + 0.05×0.1176 + 0.00×0.0392 z2 = 0.00000 - 0.00706 + 0.01568 + 0.00980 + 0.00588 + 0.00000 z2 = 0.02430

z3 = 0.30×0.0471 + 0.10×0.0706 + (-0.20)×0.0784 + 0.00×0.0980 + 0.10×0.1176 + 0.05×0.0392 z3 = 0.01413 + 0.00706 - 0.01568 + 0.00000 + 0.01176 + 0.00196 z3 = 0.01923

z4 = (-0.20)×0.0471 + 0.00×0.0706 + 0.10×0.0784 + 0.20×0.0980 + (-0.10)×0.1176 + 0.10×0.0392 z4 = -0.00942 + 0.00000 + 0.00784 + 0.01960 - 0.01176 + 0.00392 z4 = 0.01018

So the projected patch token is: z = [0.01883, 0.02430, 0.01923, 0.01018] That tiny vector is now the model's learned representation of one local square. That token is the patch after entering the eye.

4) Position matters, so add positional embeddings

One patch token alone is not enough. The model must know where that patch came from. A blue patch in the sky means one thing. The same blue patch on a shirt means another. So ViT adds positional embeddings.

Suppose patch p_17 gets position embedding: pos_17 = [0.40, -0.10, 0.05, 0.20] Then the input token to the transformer becomes: x_17 = z + pos_17 x_17 = [0.01883, 0.02430, 0.01923, 0.01018] + [0.40, -0.10, 0.05, 0.20] x_17 = [0.41883, -0.07570, 0.06923, 0.21018]

Now the token knows two things. What the local patch looked like. Where that patch lived in the image. After this, transformer blocks can mix information across all 196 tokens. A wheel patch can attend to nearby car-body patches. A face patch can attend to hair and shoulder patches. This is where local evidence becomes global understanding.

5) Why this opening works so well

ViT keeps the local chunk intact at the start. Then it gives the transformer a manageable sequence. That is the balance. Local grouping first. Global mixing next.

Compare with raw flattening. Flattening says the image is one long number list. ViT says the image is a sequence of meaningful local units. That is much closer to how vision should be organized. And it lets the eye scale using the same transformer logic that language models already use.

One more nice comparison. Patch size 16 × 16 means one token summarizes 768 pixel-channel values. So 196 tokens cover the whole image. That is why patch tokens are not arbitrary bookkeeping. They are the bridge between raw pixels and sequence models. Look. Without the patch, the transformer has no sensible visual word. With patches, the image becomes readable to the model.


Where this lives in the wild

  • OpenAI CLIP ViT-L/14 — multimodal research engineer: uses patch tokens so images can enter the same retrieval pipeline as text embeddings.
  • Google Gemini image tower — foundation model engineer: turns images into structured visual tokens before cross-modal reasoning with text.
  • Meta DINOv2 — representation learning researcher: trains ViT-style encoders so downstream teams can reuse strong patch-based visual features.
  • Apple on-device visual understanding — ML systems engineer: prefers structured image tokens for efficient perception on memory-constrained hardware.
  • Adobe Acrobat document AI — document intelligence engineer: uses patch-based encoders to preserve layout and local text-region structure in scanned pages.

Pause and recall

  • Why does a 224 × 224 image with 16 × 16 patches produce exactly 196 patch tokens?
  • Why is each 16 × 16 × 3 patch represented by 768 values before projection?
  • In the worked example, what was the projected token z before adding position?
  • Why are positional embeddings necessary after patch projection?

Interview Q&A

Q: Why does ViT use patch tokens instead of treating each pixel as a token? A: Because per-pixel tokens make the sequence far too long, while patches preserve local structure and keep attention computationally manageable. Common wrong answer to avoid: "Because single pixels contain no information at all."

Q: Why is the projection matrix shared across all patches instead of learned separately for each patch index? A: Because the same local visual patterns should be representable anywhere in the image, and sharing avoids a huge parameter blow-up. Common wrong answer to avoid: "Because positional information is useless once we split the image."

Q: Why add positional embeddings if each patch already comes from a fixed image location? A: Because after projection the transformer sees a sequence of tokens, so explicit position signals tell it where each patch came from in the 2D image. Common wrong answer to avoid: "Because attention automatically reconstructs 2D position without being told."

Q: Why can ViT scale well once the patching step is chosen carefully? A: Because patching reduces the token count enough that standard transformer blocks can model long-range visual relations efficiently. Common wrong answer to avoid: "Because transformers become cheap on images regardless of sequence length."


Apply now (5 min)

Quick exercise. Take a 32 × 32 × 3 image and choose patch size 8 × 8. Compute the patch count, the values per patch, and the token count after patching. Then assume projection dimension D = 6 and write the shape of the shared projection matrix.

Sketch from memory the full pipeline: image → patches → flatten → projection → add position → transformer. Under the sketch, write one line on why the encoder needs patch tokens before attention can work well.


Bridge. The eye can see. But it cannot speak. To connect vision to language, we need a shared space. → 03-clip-contrastive-alignment.md