Beam Search — Analysis¶
The algorithm¶
- Start with one beam:
(<bos>,), log_prob 0. - Repeat until all beams have emitted
<eos>or hitmax_len: - For each active beam, query
score_fn(beam.tokens)for next-token log probs. - Expand: every beam × every continuation = candidate set.
- Sort candidates by cumulative log prob (raw, not normalised).
- Take the top
beam_widthcandidates; beams ending in<eos>move tofinished. - Rank finished beams by length-normalised score; return top
beam_width.
Why length normalisation matters¶
Without normalisation, every additional token adds a negative log prob (log(p) < 0). Short sequences win the raw-score race because they accumulate less negative mass. Beam search would always prefer <bos> <eos> (one short sequence) over any longer beam — which defeats the purpose.
The standard length penalty is L^alpha in the denominator:
alpha = 0→ no normalisation; short beams dominate.alpha = 1→ average log prob per token; fully length-fair.alpha = 0.7→ the GNMT default; mild preference for length, the empirical sweet spot.
Run output¶
Top 3 beams (length-normalized, alpha=0.7):
norm=-0.773 logp=-3.600 tokens=<bos> a t a t a t a t
norm=-0.795 logp=-3.700 tokens=<bos> a t a t e t a t
norm=-0.795 logp=-3.700 tokens=<bos> a t a n a t a t
Same search, alpha=0.0:
logp=-2.000 tokens=<bos> <eos>
logp=-2.200 tokens=<bos> a t a t a
logp=-2.300 tokens=<bos> a t a t e
The contrast is sharp. With alpha=0.7, the top beam is 9 tokens long (vowel-consonant pattern, matching the scorer's preference). With alpha=0.0, the top beam is 2 tokens — just <bos> <eos>. Same scorer, same search, dramatically different output.
Beam width trade-off¶
beam_width = 1→ greedy; one beam, one path. Fast; no recovery from a bad token choice.beam_width = 4-8→ typical for translation, summarization, captioning. The empirically-tuned sweet spot.beam_width = 32+→ diminishing returns; computationally expensive.
Production note: very large beam widths can produce worse outputs because beam search promotes high-probability but generic continuations. This is why nucleus sampling (Exercise 08) has replaced beam search for open-ended generation in many domains.
When beam search is right¶
- Translation. The output has a clear "best" sequence; beam search converges to it.
- Constrained generation. When the search space is small and you want the best path.
- Caption generation, structured output. Where coherence matters more than diversity.
When beam search is wrong¶
- Open-ended generation (chat, creative writing). Beam search produces bland, "safe" outputs that maximise log prob but lack interesting variation. Sampling (with temperature or nucleus) wins here.
- Long outputs where beam search degenerates — paths converge to similar text; diversity collapses.
Common bugs the tests catch¶
- Forgetting to normalise length. Rankings favour
<eos>immediately after<bos>. - Selecting on normalised score during search instead of raw. Causes incorrect pruning; the in-progress beams haven't reached their final length yet.
- Mixing finished and active beams. Active beams keep growing; finished beams don't.
- Not handling
max_len. Beams that don't emit<eos>need to be promoted to finished at the end. - Forgetting to expand every beam at every step. Easy to skip a beam if logic is off.
What this exercise teaches¶
- Search algorithms with priority queues and pruning.
- Length normalisation as the structural defence against short-beam bias.
- The bridge between deterministic decoding (greedy, beam) and stochastic decoding (temperature, top-p).
Interview probes:
- "Why does beam search favour short sequences without normalisation?"
- "Why isn't beam width = vocabulary size always best?"
- "When would you choose nucleus sampling over beam search?"
- "How would you parallelise beam expansion across beams?"
- "What is the time complexity of one beam search step?" (O(B × V × log(B × V)) with sorting; O(B × V) with a heap.)