Sampling Strategies — Analysis¶

The strategies, one sentence each¶

Greedy. Argmax. Deterministic. No exploration.
Temperature sampling. Divide logits by T, softmax, multinomial sample. Higher T → flatter distribution → more exploration.
Top-k. Keep the k highest-logit tokens; sample within them. Caps the candidate set size regardless of distribution shape.
Top-p (nucleus). Keep the smallest set of tokens whose cumulative probability ≥ p; sample within. Adapts to distribution shape.

Why nucleus is preferred when entropy varies¶

A high-confidence step (one or two tokens dominate the softmax) has a sharp distribution. Top-k=50 here lets in 48 garbage tokens that have effectively zero probability — wasteful, occasionally damaging on long-tail sampling.

A low-confidence step (many roughly-equal tokens) has a flat distribution. Top-k=50 here is reasonable. Top-p=0.9 here keeps the high-mass set, which may legitimately be 30-100 tokens.

Top-p adapts: at high confidence the nucleus is small (often just 1-3 tokens); at low confidence the nucleus grows. Top-k is one-size-fits-all; top-p is one-rule-fits-many.

Combining filters¶

Production stacks typically apply top-k first (hard cap on tail), then top-p (mass cap within the cap), then temperature. The order matters because top-p computes its cumulative mass over whatever survives top-k.

In sample():

Apply top-k filter → masks long tail to -inf.
Apply top-p filter on softmax of surviving logits → masks low-mass tokens to -inf.
Apply temperature → softmax → multinomial.

If both k=10 and p=0.95 are set, the candidate set is the intersection.

Numerical stability¶

softmax(x) = exp(x - max(x)) / sum(exp(x - max(x))). The subtraction of max(x) prevents overflow on large logits. Without it, exp(1000) overflows. The test test_softmax_is_numerically_stable proves the implementation handles logits in the thousands without overflow.

Edge cases handled¶

T <= 0 → fall back to greedy (deterministic).
k >= vocab_size → no filter; falls back to plain temperature sampling.
k == 1 → always returns argmax (degenerate top-k).
p == 1.0 → keeps everything; equivalent to plain temperature.
p very small → may keep only one token (the argmax) if it dominates.

Common bugs the tests catch¶

Off-by-one in searchsorted: forgetting + 1 makes top-p exclude the boundary token.
Filtering on raw logits when the user intended filtering on softmax probabilities.
Mutating the input array instead of copying.
Forgetting to renormalise after masking → multinomial sampling with non-normalized probabilities.

What this exercise teaches¶

The math is short. The discipline is in the edge cases and the combination semantics. Live-coding interviewers usually push on:

"What if T = 0?"
"What if k > vocab_size?"
"What if all probability mass is in one token — does top-p still work?"
"Why might top-p be preferred over top-k for production?"
"Show me on the whiteboard how you'd compute the nucleus from sorted softmax probs."

Each of those is one or two lines of code if you've internalised the structure. The test suite mirrors what an interviewer would probe.