Beam Search — Analysis¶

The algorithm¶

Start with one beam: (<bos>,), log_prob 0.
Repeat until all beams have emitted <eos> or hit max_len:
For each active beam, query score_fn(beam.tokens) for next-token log probs.
Expand: every beam × every continuation = candidate set.
Sort candidates by cumulative log prob (raw, not normalised).
Take the top beam_width candidates; beams ending in <eos> move to finished.
Rank finished beams by length-normalised score; return top beam_width.

Why length normalisation matters¶

Without normalisation, every additional token adds a negative log prob (log(p) < 0). Short sequences win the raw-score race because they accumulate less negative mass. Beam search would always prefer <bos> <eos> (one short sequence) over any longer beam — which defeats the purpose.

The standard length penalty is L^alpha in the denominator:

normalized = log_prob / length^alpha

alpha = 0 → no normalisation; short beams dominate.
alpha = 1 → average log prob per token; fully length-fair.
alpha = 0.7 → the GNMT default; mild preference for length, the empirical sweet spot.

Run output¶

Top 3 beams (length-normalized, alpha=0.7):
  norm=-0.773  logp=-3.600  tokens=<bos> a t a t a t a t
  norm=-0.795  logp=-3.700  tokens=<bos> a t a t e t a t
  norm=-0.795  logp=-3.700  tokens=<bos> a t a n a t a t

Same search, alpha=0.0:
  logp=-2.000  tokens=<bos> <eos>
  logp=-2.200  tokens=<bos> a t a t a
  logp=-2.300  tokens=<bos> a t a t e

The contrast is sharp. With alpha=0.7, the top beam is 9 tokens long (vowel-consonant pattern, matching the scorer's preference). With alpha=0.0, the top beam is 2 tokens — just <bos> <eos>. Same scorer, same search, dramatically different output.

Beam width trade-off¶

beam_width = 1 → greedy; one beam, one path. Fast; no recovery from a bad token choice.
beam_width = 4-8 → typical for translation, summarization, captioning. The empirically-tuned sweet spot.
beam_width = 32+ → diminishing returns; computationally expensive.

Production note: very large beam widths can produce worse outputs because beam search promotes high-probability but generic continuations. This is why nucleus sampling (Exercise 08) has replaced beam search for open-ended generation in many domains.

When beam search is right¶

Translation. The output has a clear "best" sequence; beam search converges to it.
Constrained generation. When the search space is small and you want the best path.
Caption generation, structured output. Where coherence matters more than diversity.

When beam search is wrong¶

Open-ended generation (chat, creative writing). Beam search produces bland, "safe" outputs that maximise log prob but lack interesting variation. Sampling (with temperature or nucleus) wins here.
Long outputs where beam search degenerates — paths converge to similar text; diversity collapses.

Common bugs the tests catch¶

Forgetting to normalise length. Rankings favour <eos> immediately after <bos>.
Selecting on normalised score during search instead of raw. Causes incorrect pruning; the in-progress beams haven't reached their final length yet.
Mixing finished and active beams. Active beams keep growing; finished beams don't.
Not handling max_len. Beams that don't emit <eos> need to be promoted to finished at the end.
Forgetting to expand every beam at every step. Easy to skip a beam if logic is off.

What this exercise teaches¶

Search algorithms with priority queues and pruning.
Length normalisation as the structural defence against short-beam bias.
The bridge between deterministic decoding (greedy, beam) and stochastic decoding (temperature, top-p).

Interview probes:

"Why does beam search favour short sequences without normalisation?"
"Why isn't beam width = vocabulary size always best?"
"When would you choose nucleus sampling over beam search?"
"How would you parallelise beam expansion across beams?"
"What is the time complexity of one beam search step?" (O(B × V × log(B × V)) with sorting; O(B × V) with a heap.)