16. K-Nearest Neighbors — ask the closest houses¶

Six minutes. No training. Just distance, memory, and a tiny vote among nearby examples.

Built on the ELI5 in 00-eli5.md. A feature list becomes a point on a map. The nearest old examples form a tiny voting panel that makes the prediction. KNN is the smallest local voting panel possible.

The picture before math¶

See. A new house hits the market. We do not fit a formula. We do not learn weights. We simply open the old sales archive. Then we ask: "Which past houses look most similar?" Those nearest houses vote on the likely price band.

past examples on a feature map
high_value     high_value
    x              x
         ? new house
              o  low_value
                 o  low_value
look at the nearest K points
let them vote

That is KNN. No real training phase. Training is mostly: "Store the data. Keep it clean. Index it if needed." So what is the tradeoff? KNN is lazy during training. KNN works hard during inference.

Classification and regression — same idea¶

For classification, neighbors vote by majority. If 2 of the 3 nearest houses were high-value, predict high-value. For regression, neighbors average a number. If the 3 nearest homes sold for 90, 100, and 110 lakhs, predict 100 lakhs. Same rule. Different output. The feature list is still the point in space. The only new ingredient is distance.

What K actually controls¶

K is how many neighbors you listen to. - Small K → very local. Flexible. Noisy. Can overfit. - Large K → smoother. Stable. Can underfit.

K = 1   → listen to one loud neighbor
K = 3   → small voting panel
K = 25  → broad neighborhood average

So what to do? Pick K on validation data. There is no magic universal number.

Distance metrics — because "near" depends on the problem¶

Near in what sense? That is the full question.

Euclidean distance¶

Straight-line distance.

Euclidean((0,0),(3,4)) = √(3² + 4²) = 5

Good when numeric features live on a meaningful geometric scale.

Manhattan distance¶

Grid-walk distance.

Manhattan((0,0),(3,4)) = |3| + |4| = 7

Useful when movement happens axis by axis. Also useful when outliers should hurt less than Euclidean.

Cosine distance¶

Angle, not length. Two vectors with similar direction are considered close. Very useful for text. A long document and a short document can discuss the same topic. Cosine sees that.

cosine similarity = (a · b) / (||a|| ||b||)

So the picture changes a little. Euclidean asks, "How physically close are these data points?" Cosine asks, "Are their feature patterns pointing in the same direction?"

Worked example — classify one house with `K = 3`¶

Take six labeled houses on a normalized feature map. Let the two features be: - first coordinate = size score - second coordinate = age score High-value class: - h1 = (1,1) - h2 = (2,1) - h3 = (1,2) Low-value class: - l1 = (4,3) - l2 = (5,4) - l3 = (4,4) New house: - q = (3,2) Use Euclidean distance.

Step 1 — compute distances to high-value houses¶

d(q,h1) = √[(3-1)² + (2-1)²] = √5  ≈ 2.24
d(q,h2) = √[(3-2)² + (2-1)²] = √2  ≈ 1.41
d(q,h3) = √[(3-1)² + (2-2)²] = √4  = 2.00

Step 2 — compute distances to low-value houses¶

d(q,l1) = √[(3-4)² + (2-3)²] = √2  ≈ 1.41
d(q,l2) = √[(3-5)² + (2-4)²] = √8  ≈ 2.83
d(q,l3) = √[(3-4)² + (2-4)²] = √5  ≈ 2.24

Step 3 — sort nearest neighbors¶

Nearest three are: 1. h2 at 1.41 2. l1 at 1.41 3. h3 at 2.00 So the vote is: - High-value = 2 - Low-value = 1 Final prediction:

q → High-value

Nice detail. If K = 1, there is a tie in nearest distance and implementation details matter. If K = 5, the broader neighborhood may change the answer. So K is not decoration. It changes the boundary shape.

Why KNN has no real training phase¶

Interviewers ask this all the time. KNN is called a lazy learner. Why? Because it postpones the work. During training, it mostly stores the data. During inference, it computes distance from the query to many stored points. So complexity shifts. - Training: near-zero beyond storage and optional indexing - Inference: roughly O(n · d) per query n is number of training points. d is feature count. That is why KNN feels easy in notebooks and painful in production.

Approximate nearest neighbors¶

Exact KNN is O(n · d) per query. Too slow for millions of points. FAISS (Facebook), Annoy (Spotify), and ScaNN (Google) build index structures for approximate search. They trade a tiny accuracy loss for 100–1000× speedup. If you use KNN in production, you usually use ANN.

Decision boundary shape — very flexible, very local¶

KNN can draw wild boundaries. Why? Because every new query builds its own tiny local rule.

query here    → consult these neighbors
query there   → consult a different set

So the boundary is not one global street like SVM. It is a patchwork. That is the strength. That is also the weakness. On clean local patterns, it is brilliant. On noisy data, it becomes twitchy.

The curse of dimensionality¶

Now the bad news. KNN hates high dimensions. Why? Because in high-D space, everything starts feeling equally far. The nearest neighbor is no longer much nearer than the farthest one. Then the local voting panel stops being local.

2D   → neighbors are meaningfully close
100D → all distances bunch together

Also, each extra feature gives one more way to be different. Many of those features are noise. So the distance becomes polluted. This is why KNN often fails on sparse text or huge embedding spaces unless you do careful metric design and indexing.

Practical knobs that matter¶

A few details decide whether KNN is decent or useless. - Feature scaling. If age is 0–100 and income is 0–10,00,000, income dominates distance. Standardize first. - Odd K for binary classification. Helps avoid ties. - Distance weighting. Give closer neighbors more vote than far ones. - Missing values. KNN does not naturally enjoy them. - Approximate nearest neighbor search. Needed when data is large. So what to do first in practice? Scale the feature list. Always.

When KNN works well¶

KNN wins when: - data is small - dimensions are low or moderate - similar inputs really do imply similar outputs - you want an interpretable local explanation - retraining every day is annoying and you prefer just adding new cases In these cases, you can literally point to three old houses and say: "This new listing looks like them." That is powerful in interviews and in product reviews.

When KNN fails¶

KNN loses when: - data is large - features are high-dimensional - inference latency matters - irrelevant features pollute distance - class boundaries need learned weighting, not raw distance And remember the core cost:

inference cost ≈ O(n · d)

So a million rows and a thousand features? That is not a friendly setup.

Where this lives in the wild¶

Amazon similar-items widgets. Nearest-neighbor retrieval over item features or embeddings is the direct idea behind “customers also viewed” style candidate generation.
Spotify song similarity. Audio and embedding vectors are often searched by nearest neighbors to surface tracks that feel locally similar in style or mood.
Zillow-style price comps. KNN regression is the mental model for nearby-home valuation: similar location, size, age, and sale date imply similar price.
Apple Photos face grouping. After embeddings are built, nearest-neighbor search is a core move for finding visually similar faces or images.
Uber location similarity. Pickup zones or route patterns are often compared by nearest vectors before more complex ranking models refine the answer.

Interview Q&A¶

Q: Why is KNN called a lazy learner?
A: Because it does almost no parameter learning during training. It stores the labeled examples and postpones the real computation until prediction time, when it measures distances to many training points. The model is the dataset.
Common wrong answer to avoid: "Because KNN is simple." Simplicity is not the point. Laziness refers to when computation happens. Q: How do you choose K?
A: By validation. Small K gives low bias and high variance; large K gives high bias and low variance. In practice, sweep a few odd values, standardize features, and pick the K that wins on held-out data.
Common wrong answer to avoid: "Always use 3 or 5." That is folk wisdom, not model selection. Q: How do approximate nearest-neighbor methods like FAISS work?
A: They partition the feature space or build fast graph indexes. IVF (inverted file index) uses coarse buckets, HNSW uses a hierarchical navigable small-world graph, and both search only nearby regions at query time. Accuracy drops a little, but latency drops from seconds to milliseconds.
Common wrong answer to avoid: "FAISS is just KNN in C++." No. FAISS uses fundamentally different algorithms — IVF, HNSW, and product quantization — for sub-linear approximate search. Q: Why does KNN fail in high dimensions?
A: Because distances concentrate. The nearest and farthest points become similarly distant, so “local” neighborhoods stop being meaningful. Noise features also swamp useful ones, unless you learn a better representation first.
Common wrong answer to avoid: "Because there are too many calculations." Computation hurts, yes, but the deeper problem is geometry. Q: When should I use cosine distance instead of Euclidean?
A: When direction matters more than magnitude, especially in text or sparse count vectors. Two documents with very different lengths can still talk about the same topic. Cosine sees similar pattern; Euclidean overreacts to length.
Common wrong answer to avoid: "Cosine is always better for high-dimensional data." No. If magnitude carries signal, cosine throws that away.

Apply now (5 min)¶

Take the worked example. Without a calculator, recompute the six distances from q = (3,2). Then answer: 1. What is the prediction for K = 3? 2. What changes if you use K = 1? 3. Why must you scale features before KNN? Then — without looking — sketch from memory: 1. A query point with its three nearest neighbors circled. 2. One line on Euclidean vs Manhattan vs cosine. 3. One line on why KNN has no real training phase. If you can do all three in 90 seconds, you own KNN.

Bridge. KNN asks neighbors. But what if you could compute the probability directly from the data distribution? That is what Naive Bayes does. Read 17-naive-bayes.md next.