19. K-Means, DBSCAN, and Clustering — find groups before anyone names them¶

Seven minutes. No labels. Just structure, density, centroids, and the danger of seeing patterns that are not really there.

Built on the ELI5 in 00-eli5.md. The dataset — before any labels are assigned — can still have data points naturally forming groups. Clustering tries to spot those groups without being told the names.

The picture before math¶

See. Sometimes you have no labels. No label column. No fraud yes/no. No churned/not churned. Just a table of customers. Can we still learn something? Yes. We can look for structure. Maybe the points naturally form groups. Maybe one group is dense. Maybe a few points are isolated noise. That is clustering.

points on a map
xxx          ooo
 xx          oo
     .  .
left cloud   right cloud   stray noise

Important honesty. Clusters are not truth. They are patterns under a chosen distance notion. Change the scaling or metric. The groups may change. So the algorithm is describing structure, not discovering divine labels.

K-Means — centroid, assign, update, repeat¶

K-Means is the first clustering algorithm people learn. It assumes each cluster has a center. Call that center the centroid. Then it repeats two steps. 1. Assign each point to the nearest centroid. 2. Recompute each centroid as the mean of its assigned points. Repeat until hands_on_labs stop changing.

start centroids
    ↓
assign points to nearest center
    ↓
move centers to the mean
    ↓
repeat

That is the whole loop. Fast. Simple. Very interviewable.

K-Means++ — better starting centroids¶

Standard K-Means picks random starting centroids. Bad luck can give bad clusters. K-Means++ picks the first centroid randomly, then picks each next centroid with probability proportional to its distance from the nearest existing centroid. So the centroids spread out before the loop even starts. It usually converges faster and more reliably. In sklearn, init='k-means++' is the default.

Worked example — K-Means on 6 points with 2 clusters¶

Take six unlabeled points: - A = (1,1) - B = (1,2) - C = (2,1) - D = (8,8) - E = (8,9) - F = (9,8) Set K = 2. Initialize centroids as: - c1 = (1,1) - c2 = (9,9)

Round 1 — hands_on_lab¶

Points near c1: - A, B, C Points near c2: - D, E, F Easy separation.

Round 1 — update centroids¶

New c1 is the mean of A, B, C:

c1 = ((1+1+2)/3, (1+2+1)/3)
   = (4/3, 4/3)
   ≈ (1.33, 1.33)

New c2 is the mean of D, E, F:

c2 = ((8+8+9)/3, (8+9+8)/3)
   = (25/3, 25/3)
   ≈ (8.33, 8.33)

Round 2 — hands_on_lab again¶

Now recheck hands_on_labs. All first three points are still much closer to c1. All last three points are still much closer to c2. So hands_on_labs do not change. Algorithm converges. Final clusters are: - Cluster 1: A, B, C - Cluster 2: D, E, F That is K-Means in one page.

What K-Means is really minimizing¶

K-Means minimizes within-cluster squared distance. Usually written as:

Σ over clusters Σ over points in cluster ||x - centroid||²

Picture first. Each cluster wants tight points around its center. Big spread is punished. So K-Means loves: - roundish clusters - similar densities - similar sizes That hidden assumption matters. If the true groups are curved or unequal, K-Means can look silly.

Choosing K — elbow and silhouette¶

Now the annoying question. How many clusters?

Elbow method¶

Plot total within-cluster error vs K.

error
^
| *
|  *
|   *
|    *
|      *
+-----------------> K
      elbow here

Choose the point after which extra clusters give only small improvement.

Silhouette score¶

Silhouette compares: - how close a point is to its own cluster - how far it is from the next nearest cluster High silhouette is good. Near zero means overlap. Negative means a point may sit in the wrong cluster. So what to do in practice? Use elbow plus silhouette plus domain sense. Never treat one metric as holy.

DBSCAN — density, not centroids¶

K-Means assumes round groups around centroids. DBSCAN asks a different question. "Where are the dense neighborhoods?" It uses two knobs: - eps = neighborhood radius - min_samples = minimum points needed to call a place dense Then it builds clusters from dense connected regions. Points that do not belong anywhere become noise.

curved dense shape
xxxx
   xxxx
      xxxx     . noise
DBSCAN can keep the whole curved chain together
K-Means would try to cut it into round pieces

So DBSCAN wins when clusters have arbitrary shapes. Also when explicit noise detection matters.

Hierarchical clustering — merge step by step¶

Hierarchical clustering does not start with centroids. It builds a tree. At the start, every point is its own cluster. Then clusters merge step by step. The tree is a dendrogram.

A   B   C   D
|   |   |   |
|---|   |   |
|   |---|   |
|-------|   |
|-----------|

Cut the tree high. Few broad clusters. Cut it low. Many fine clusters. Important detail. The merge rule depends on linkage. - single linkage → nearest pair - complete linkage → farthest pair - average linkage → average distance - Ward linkage → variance-based merging Different linkage, different tree.

Gaussian Mixture Models — soft clustering¶

Gaussian Mixture Models (GMM) are soft clustering. Each point gets a probability of belonging to each cluster, not one hard hands_on_lab. Use GMM when clusters overlap or when you want uncertainty estimates instead of a forced winner.

When clustering fails¶

Now the honest admission. Clustering is fragile. It fails when: - dimensions are high and distance becomes noisy - clusters have unequal density - features are poorly scaled - there is no real cluster structure but you force one anyway - K-Means is used on curved clusters So what to do first? Scale the features. Then visualize. Then question whether clustering is even the right move. Underfitting here is using a clustering rule so blunt that real structure disappears. Overfitting is tuning knobs until noise starts looking meaningful.

K-Means vs DBSCAN vs hierarchical — one clean contrast¶

K-Means asks for K and returns centroid-based groups.
DBSCAN asks for density knobs and can mark noise.
Hierarchical returns a merge tree and lets you choose the cut later. Use K-Means when clusters are compact and you mainly want segmentation. Use DBSCAN when shape is weird or noise matters. Use hierarchical when you want multi-scale grouping and a dendrogram story.

Where this lives in the wild¶

Amazon customer segmentation. Purchase-frequency and basket features are clustered to design marketing cohorts and lifecycle campaigns.
Spotify playlist and song grouping. Embeddings for mood, genre, and audio structure are clustered to organize catalogs and discover local music neighborhoods.
Uber city demand zones. Pickup and dropoff patterns are clustered to reveal operational regions before pricing and dispatch models refine the policy.
CrowdStrike threat analysis. Similar malware behaviors or alert patterns are grouped to reduce analyst overload and surface attack families.
23andMe and genomics labs. PCA plus clustering is a standard pipeline for discovering sample groups, cell types, or population structure before deeper biological analysis.

Interview Q&A¶

Q: What assumption does K-Means make?
A: It assumes clusters can be represented well by centroids and that minimizing within-cluster squared distance is meaningful. In practice that means compact, roughly spherical, similarly scaled groups. If the data has long curved shapes or very unequal densities, K-Means can fail badly.
Common wrong answer to avoid: "K-Means works for any kind of clusters if K is correct." K fixes count, not geometry. Q: What is K-Means++ and why does it matter?
A: It initializes centroids by spreading them out. Each new centroid is placed far from the existing ones, so you avoid the common failure where two centroids start inside the same real cluster. That makes convergence faster and the final solution more reliable on average.
Common wrong answer to avoid: "K-Means++ guarantees the global optimum." No. It only improves initialization. K-Means can still converge to a local minimum. Q: When would you choose DBSCAN over K-Means?
A: When clusters are irregularly shaped, when noise points should stay unassigned, or when you do not want to pre-specify K. DBSCAN grows dense regions instead of pulling points toward centroids. It is especially good for “blobs plus outliers” style problems.
Common wrong answer to avoid: "DBSCAN is always better because it finds arbitrary shapes." It struggles when densities differ a lot or when high-dimensional distance becomes meaningless. Q: How do you choose the number of clusters?
A: Use multiple signals. Elbow for diminishing returns, silhouette for separation quality, and domain knowledge for whether the segmentation is actionable. Clustering is exploratory, so the right K is often a business decision as much as a mathematical one.
Common wrong answer to avoid: "Pick the K with the highest silhouette and you're done." That ignores usefulness and stability. Q: Is clustering supervised or unsupervised?
A: Unsupervised. There are no labels during training. The algorithm is finding structure under a distance or density notion, not matching known labels.
Common wrong answer to avoid: "Unsupervised means the clusters are automatically true." No. Clusters are patterns under your metric and scaling choices.

Apply now (5 min)¶

Take the six-point example. Without notes: 1. Recompute the two updated centroids. 2. Explain why the hands_on_labs do not change in round 2. 3. Say in one line why DBSCAN would be better for curved clusters. Then sketch from memory: 1. The assign → update → repeat loop. 2. A curved dense chain with one stray noise point. 3. A tiny dendrogram with a cut line. If you can do all three in 90 seconds, you own clustering.

Bridge. You now have the full classical ML toolkit — supervised and unsupervised. But classical ML has limits no algorithm can fix — unstructured data, automatic feature learning, and problems where the feature list is really an image or a sentence. The next file is the honest admission. Read 20-honest-admission.md next.