18. PCA and Dimensionality Reduction — rotate the data, keep the signal¶
Seven minutes. Too many columns. Too much empty space. We turn the cloud, keep the strongest directions, and throw away the sleepy ones.
Built on the ELI5 in 00-eli5.md. The feature list — when it becomes too long, repetitive, and noisy — needs compression before the model can learn effectively. PCA is the cleanest compression picture.
The picture before math¶
See. Suppose the dataset has 200 features. Many tell the same story. Age and years-in-workforce. Height and arm length. Clicks-last-week and clicks-last-month. So the feature list gets wide, but not truly informative. Now imagine the data cloud in space. It is not spread equally in every direction. Usually it is stretched strongly along a few directions. PCA says: "Turn the axes. Align them with the longest stretches. Keep the main directions. Drop the sleepy ones."
original axes rotated axes
y ^ PC2 ^
| /
. . . / data cloud
. . ./
. . ./
. . ./______> PC1
. . .
+------> x
Why high-dimensional spaces are awkward¶
Now what is the problem? High-D space is mostly empty. Distances become strange. Neighbors become less meaningful. Data needs far more samples to fill the space. This is the curse of dimensionality. A small numeric picture helps. - In 1D, ten points can cover a line roughly okay. - In 10D, ten points are almost nothing. - In 100D, the space is a desert. So what to do? Either choose fewer features. Or compress them into fewer directions. PCA is the classical feature-extraction answer.
PCA in plain steps¶
PCA does four things.
1. Center the data.
2. Factor the centered matrix with SVD: X = UΣVᵀ.
3. Read the columns of V as the principal components.
4. Project onto the top columns of V.
Read that geometrically.
- Centering moves the cloud to the origin.
- V gives the rotated axes.
- Σ² / (n-1) gives the eigenvalues = variance explained.
- sklearn.PCA uses SVD internally because it is faster and numerically stabler than forming the covariance matrix first.
So PCA is not magic. It is a rotation plus compression. Simple, no?
Worked example — 3D to 2D PCA with a covariance matrix¶
Suppose the centered data gives this covariance matrix:
Read the picture first. Feature 1 and feature 2 move together. That is why covariance2.0 appears off-diagonal.
Feature 3 barely varies.
That is why its variance is only 0.2.
Step 1 — find the eigenvalues¶
For the top-left 2 × 2 block, the eigenvalues are:
Step 2 — compute explained variance ratio¶
So keeping only the first two principal components preserves: The third direction is sleepy. We can drop it.Step 3 — use the top two eigenvectors¶
Approximate eigenvectors are:
So PC1 is mostly a mix of feature 1 and feature 2. PC2 is the orthogonal leftover direction inside that same plane. PC3 is almost just feature 3.Step 4 — project one point¶
Take one centered point:
Project onto the first two components: We dropz3.
So the 3D point becomes the 2D point:
That is PCA.
The model kept almost all useful variation while shrinking the chart.
Eigenvalues = variance explained¶
This is the interview line. The eigenvalue of a principal component tells how much variance lies along that direction. Big eigenvalue? Important direction. Tiny eigenvalue? Mostly noise or redundancy. So PCA sorts directions by usefulness. Not by label relevance. That last sentence matters. PCA is unsupervised. It keeps variance. It does not know what helps prediction.
Scree plot — where to cut¶
A scree plot shows eigenvalues from largest to smallest.
You look for the elbow. After that point, extra components add little. In our example, the elbow is after component 2. So 2D is enough.Feature selection vs feature extraction¶
Students mix these up. Do not.
Feature selection¶
Pick some original columns. Example: keep age, income, debt. Drop the rest. Columns stay human-readable.
Feature extraction¶
Build new columns from old ones. Example:
Now the new feature is a mixture. Harder to interpret. But often more compact. So: - selection keeps old axes - extraction rotates to new axes PCA is feature extraction.t-SNE and UMAP — for visualization only¶
These tools are powerful.
They make beautiful 2D plots.
But do not confuse them with PCA.
PCA is linear and stable.
It preserves global variance structure.
t-SNE and UMAP preserve local neighborhoods much more aggressively.
That is why clusters look separated.
That is also why distances and axes become hard to interpret.
So what is the rule?
- PCA → preprocessing, denoising, compression, linear structure
- t-SNE / UMAP → visualization and exploration
Do not feed a t-SNE plot as normal preprocessing into a production model and feel clever.
That is usually a mistake.
When PCA helps¶
PCA helps when: - features are correlated - noise lives in low-variance directions - you want compression before another model - you want 2D or 3D visualization - you need to speed up downstream methods like KNN It is especially helpful before distance-based models. Because KNN in raw high-D space is often miserable.
When PCA hurts¶
It hurts when: - the task signal lives in low-variance directions - interpretability of original features is critical - relationships are strongly nonlinear - you forget to standardize and one large-scale feature dominates PCA is not a universal upgrade. It is a geometric tool. Use it when the geometry matches.
Where this lives in the wild¶
- Eigenfaces in face recognition. Classical face pipelines compress thousands of pixel values into a small PCA basis before matching identities.
- Apple Watch and IoT sensors. Highly correlated accelerometer channels are often compressed before lightweight downstream modeling on-device.
- Goldman Sachs-style risk dashboards. PCA on correlated financial indicators reveals a few dominant latent market factors instead of hundreds of noisy columns.
- Manufacturing quality analytics at Siemens. Correlated sensor streams are reduced into a few dominant modes for anomaly dashboards and root-cause analysis.
- Single-cell genomics at Illumina-scale workflows. PCA is the standard first compression step before UMAP or clustering because raw gene space is enormous and sparse.
Interview Q&A¶
Q: What does PCA optimize?
A: It finds orthogonal directions of maximum variance and projects the data onto them. The first principal component captures the largest possible variance, the second captures the largest remaining variance subject to orthogonality, and so on. It is a compression objective, not a prediction objective.
Common wrong answer to avoid: "PCA finds the features most correlated with the label." It does not use the label at all.
Q: Why should features usually be standardized before PCA?
A: Because PCA is variance-driven. If one feature is measured on a huge numeric scale, it can dominate the covariance matrix even if it is not truly more informative. Standardization makes variance comparisons fair across columns.
Common wrong answer to avoid: "Standardization is optional because PCA rotates the data anyway." Rotation after a distorted scale is still distorted.
Q: Should I use t-SNE or UMAP as normal preprocessing before my classifier?
A: Usually no. They are mainly for visualization because they distort global geometry and are not designed to preserve predictive structure for downstream models. Use PCA for stable compression; use t-SNE or UMAP to inspect patterns.
Common wrong answer to avoid: "UMAP always improves classification because it makes clusters visible." Visual separation on a 2D plot is not the same thing as reliable supervised preprocessing.
Q: What is the relationship between PCA and SVD?
A: PCA finds eigenvectors of the covariance matrix. SVD decomposes the centered data matrix directly, and the right singular vectors in V are the principal components. SVD is preferred computationally because you do not need to form the covariance matrix first.
Common wrong answer to avoid: "PCA and SVD are different algorithms." PCA is the goal; SVD is the most common implementation.
Apply now (5 min)¶
Take the covariance matrix from the worked example. Without notes, answer: 1. Which feature pair is strongly correlated? 2. Which component explains the least variance? 3. If you keep PC1 and PC2, what percent variance remains? Then sketch from memory: 1. A tilted data cloud with old axes and rotated PC axes. 2. A tiny scree plot with a clear elbow after PC2. 3. One sentence: feature selection vs feature extraction. If you can do all three in 90 seconds, you own PCA.
Bridge. PCA finds structure by compressing dimensions. But what if there are natural groups in the data with no labels at all? That is clustering. Read 19-clustering.md next.