03. What shapes can your model actually draw?¶

Five minutes. One core idea: every model family is a vocabulary of shapes. Pick the wrong vocabulary, no tuning saves you.

Built on 00-eli5.md. The feature list, the decision boundary, and the positive class are the picture to keep open. Picking a model is picking a shape vocabulary.

The mental model — shapes, not formulas¶

See. Look at any 2D spam-classification problem. Plot spam as +, ham as o. The boundary between them has a shape. A line. A curve. A box. A blob.

Every model family can only draw boundaries from its own private vocabulary.

Linear regression and logistic regression — one straight line.
Polynomial regression — smooth curves.
Decision tree — axis-aligned rectangles (slice along suspicious-word count, then along link count, then along word count again).
RBF-kernel SVM — soft circular blobs.
kNN — Voronoi cells, jagged regions stitched from training points.

Now what is the problem? The data has a shape. The model has a vocabulary. If they match, low bias, easy training. If they mismatch — underfitting. The model cannot draw the boundary it needs no matter how long you train.

Recall the ELI5 picture. The feature list (word_count, link_count) lives in a 2D plane. The positive class "many spammy words and many links" is a corner box. A linear model can only draw a tilted line across that plane — never a box. Wrong vocabulary. Structurally wrong.

The five shape vocabularies — side by side¶

   Linear            Polynomial         Tree (axis-aligned)
   y                 y                  y
   ^                 ^                  ^
   | + + + + +       |   + + + +        | o | + | +
   | + + + + +       |  +       +       |---+---+---
   |---------        | +    o    +      | o | + | o
   | o o o o o       |  +       +       |---+---+---
   | o o o o o       |   + + + +        | o | o | o
   +---------> x     +-----------> x    +-----------> x

   RBF kernel               kNN (Voronoi)
   y                        y
   ^                        ^
   |   o o o o              | o |  +  +
   |  o + + + o             | o + + + o
   |  o + + + o             | o + + o |
   |   o o o o              | o o | o |
   +-----------> x          +-----------> x

One straight cut. A smooth curve. A staircase of boxes. A soft round bubble. A stitched-together quilt of nearest-point regions. Five different model families, five different shape vocabularies.

The voting panel from ELI5 is what happens when you stack many trees — each tree's staircase boundary, averaged, becomes a smooth quilt. Random forests and gradient boosting smooth the rough edges of single-tree rectangles by voting. One tree = blocky. Hundred trees = almost-smooth.

Why a linear model cannot draw a circle — three honest attempts¶

Suppose disease lives inside a circle: positive if x² + y² < 1, negative outside. Centred unit disc.

   y
   ^
   |    o o o o
   |   o + + + o
   |  o + + + + o
   |   o + + + o
   |    o o o o
   +--------------> x

A linear classifier predicts w1·x + w2·y + b > 0. One line cuts the plane into two half-planes. We hunt for w1, w2, b that put all + on one side and all o on the other.

Attempt 1 — vertical line `x = 0`¶

Pick w1 = 1, w2 = 0, b = 0.

Point	w1·x + w2·y + b	Predicted	Target	Match?
(-0.5, 0) inside	-0.5	o	+	✗
(+0.5, 0) inside	+0.5	+	+	✓
(-2, 0) outside	-2.0	o	o	✓
(+2, 0) outside	+2.0	+	o	✗

Half the disc is on the wrong side. Half the outside is on the wrong side. 50%.

Attempt 2 — horizontal line `y = 0`¶

w1 = 0, w2 = 1, b = 0. Symmetric to attempt 1. Same 50%.

Attempt 3 — diagonal `x + y = 0`¶

w1 = 1, w2 = 1, b = 0.

Point	sum	Predicted	Target
(-0.5, -0.5) inside	-1.0	o	+ ✗
(+0.5, +0.5) inside	+1.0	+	+ ✓
(-2, -2) outside	-4.0	o	o ✓
(+2, +2) outside	+4.0	+	o ✗

Same problem. Any line you draw has positives on both sides and negatives on both sides. Always 50%.

The structural reason¶

A line divides the plane into two infinite half-planes. The positive class lives inside a bounded disc. No infinite half-plane equals a bounded disc. Geometry forbids it.

The fix — feature engineering changes the vocabulary¶

Add one new feature: r² = x² + y². Now the data lives in 3D. The decision rule becomes r² < 1, which is a single threshold on r². That is a linear boundary in the new space — a flat plane at r² = 1.

   r² ↑
   |    o o o o o   ← outside the circle, r² > 1
   |  ----------    ← linear cut at r² = 1
   |    + + + +     ← inside, r² < 1
   +---------------> x

Same data. Richer vocabulary. Now linearly separable. This is what kernel SVMs do automatically — the RBF kernel is "linear in an infinite-dimensional space of squared distances". This is also what hidden layers in neural nets do — they learn the right features so the final layer's line works.

Neural nets — learned boundaries with many bends¶

A deep net is not stuck with one line, one circle, or one staircase. Each hidden layer bends the feature space a little. Stack enough bends and the final layer can trace very irregular shapes.

spam region:   ╭─╮   ╭────╮
               │ │╭──╯    ╰─╮
               ╰─╯│  ham     │
                  ╰──────────╯

That flexibility is powerful. It is why deep nets can learn arbitrary decision boundaries. But the bill is heavy: much more data, much more tuning, much more compute. Simple, no? More shape freedom always sends you a bigger bill. For two clean tabular features like word count and link count, a tree or kernel SVM often wins faster. Use deep nets when the raw input itself is messy — email text, attachments, sender history, embeddings — and you need the model to learn the right features for you.

Vocabulary is not fixed. You expand it through features (manual), kernels (implicit), or layers (learned).

In the wild — where each shape vocabulary ships¶

Booking.com hotel ranking — gradient-boosted trees. Every search returns a list ranked by predicted booking probability. The signal mixes price, location, review score, room type, season, device, and dozens of interactions. Trees split on these one feature at a time, and 500-tree boosted ensembles smooth the staircase into something nearly continuous. Linear ranking models cannot capture "discount + business traveller + Tuesday" as a joint signal.
Kaggle tabular winners — XGBoost and LightGBM. Across hundreds of tabular Kaggle competitions, gradient-boosted trees won the majority. Why? Tabular features are heterogeneous — mixed scales, missing values, monotonic and non-monotonic relationships. Rectangle vocabulary plus ensembling matches that shape. Deep nets struggle on this exact terrain.
Spotify and Netflix recommendation — kNN over learned embeddings. First, a deep model (or matrix factorization) learns a vector per user and per item. Then for retrieval, kNN finds the nearest neighbours in that embedding space. Voronoi vocabulary works because embeddings put similar items together — closeness in embedding space is the prediction.
Affirm credit underwriting — boosted trees plus logistic-regression audit. The shipped model is a tree ensemble for accuracy. The audit witness is a logistic regression for ECOA-compliant explanations. Two vocabularies, two roles in one production system.
Gmail spam filtering — deep nets over message embeddings. Raw email text, sender history, attachment signals, and link patterns do not live in a neat hand-engineered feature space. Deep models learn the right shape vocabulary directly from the messy input, then draw a spam-vs-ham boundary in that learned space.

Pause and recall. Without scrolling. Name the five shape vocabularies. State the one feature you add to make a circle linearly separable. What extra power do hidden layers add beyond those five? Why does Booking's ranking use trees and not logistic regression? If any link is fuzzy, scroll back.

Interview Q&A¶

Q: Why do trees dominate tabular data? A: Tabular features are heterogeneous in scale and meaning, and the boundary often involves axis-aligned interactions ("if amount > X and country is Y"). Trees split on one feature at a time, naturally handling mixed scales and conditional logic. Boosting ensembles many weak trees so the staircase boundary becomes smooth where the data demands it. Common wrong answer to avoid: "trees are more powerful than neural nets". They are not. They are better-matched to the shape language of tabular problems specifically. On images, audio, or text, the same trees lose badly.

Q: Why doesn't a linear model with enough features always work? A: It can — but only if you already know which features to engineer. The XOR-style failure goes away once you add x1·x2. The circle failure goes away once you add x² + y². The catch is you have to know the right interaction in advance. Hidden layers and kernels learn or implicitly construct those features for you. Linear with hand-engineered features is workable only when the feature engineer understands the data deeply. Common wrong answer to avoid: "you just need more features". Random extra features add noise, not signal. Wrong vocabulary plus more dimensions is still wrong vocabulary.

Q: What is a kernel really? A: A kernel is a similarity function k(x, x') that secretly equals a dot product in a much higher-dimensional space. The RBF kernel exp(-||x - x'||²) corresponds to an infinite-dimensional feature map. You never compute the features — you just compute similarities and run a linear method on top. The shape vocabulary becomes "linear combinations of bumps centred at training points", which lets you draw soft circular blobs and smooth curves without explicitly engineering them. Common wrong answer to avoid: "A kernel is just a trick to make SVM nonlinear." Too shallow. The real point is that it changes the feature space so a linear separator in that new space becomes a curved boundary in the original space.

Q: When would you reach for kNN over a tree? A: When the boundary is smooth and locally varying, when you have learned embeddings (so distance is meaningful), and when you need a strong baseline with zero training cost. Recommendation retrieval is the canonical case — embeddings already encode similarity, so kNN's Voronoi vocabulary is the right match. Avoid kNN when features are heterogeneous or when latency matters at inference (kNN is slow at predict time, fast at train time — opposite of trees). Common wrong answer to avoid: "kNN is just a baseline, never use it in production". Spotify, Pinterest, and YouTube all serve recommendations with approximate kNN at massive scale.

Apply now (5 min)¶

Take a blank page. Without looking back, sketch the five boundary shapes from memory:

Linear — one straight line through scattered + and o.
Polynomial — one smooth curve.
Tree — axis-aligned rectangles, three or four boxes labelled + or o.
RBF kernel — a soft circular blob of + surrounded by o.
kNN — a Voronoi-style quilt of jagged regions.

Then add one more line under the drawing: neural net = many learned bends stitched together. Write one sentence per shape: which kind of real-world problem matches this vocabulary? Then write one sentence on when a neural net is worth the extra data and compute bill. If any part feels fuzzy, return to the side-by-side picture above.

The whole module rests on this drawing. If you can produce it cold in 90 seconds, you own the shape-matching mental model.

Bridge. Picking the right shape vocabulary is half the battle. The other half is stopping the model from drawing too wild a shape and memorizing the training set. That is overfitting from ELI5 — and the cure is regularization. Read 04-regularization.md next.