Skip to content

Beam Search — Analysis

The algorithm

  1. Start with one beam: (<bos>,), log_prob 0.
  2. Repeat until all beams have emitted <eos> or hit max_len:
  3. For each active beam, query score_fn(beam.tokens) for next-token log probs.
  4. Expand: every beam × every continuation = candidate set.
  5. Sort candidates by cumulative log prob (raw, not normalised).
  6. Take the top beam_width candidates; beams ending in <eos> move to finished.
  7. Rank finished beams by length-normalised score; return top beam_width.

Why length normalisation matters

Without normalisation, every additional token adds a negative log prob (log(p) < 0). Short sequences win the raw-score race because they accumulate less negative mass. Beam search would always prefer <bos> <eos> (one short sequence) over any longer beam — which defeats the purpose.

The standard length penalty is L^alpha in the denominator:

normalized = log_prob / length^alpha
  • alpha = 0 → no normalisation; short beams dominate.
  • alpha = 1 → average log prob per token; fully length-fair.
  • alpha = 0.7 → the GNMT default; mild preference for length, the empirical sweet spot.

Run output

Top 3 beams (length-normalized, alpha=0.7):
  norm=-0.773  logp=-3.600  tokens=<bos> a t a t a t a t
  norm=-0.795  logp=-3.700  tokens=<bos> a t a t e t a t
  norm=-0.795  logp=-3.700  tokens=<bos> a t a n a t a t

Same search, alpha=0.0:
  logp=-2.000  tokens=<bos> <eos>
  logp=-2.200  tokens=<bos> a t a t a
  logp=-2.300  tokens=<bos> a t a t e

The contrast is sharp. With alpha=0.7, the top beam is 9 tokens long (vowel-consonant pattern, matching the scorer's preference). With alpha=0.0, the top beam is 2 tokens — just <bos> <eos>. Same scorer, same search, dramatically different output.

Beam width trade-off

  • beam_width = 1 → greedy; one beam, one path. Fast; no recovery from a bad token choice.
  • beam_width = 4-8 → typical for translation, summarization, captioning. The empirically-tuned sweet spot.
  • beam_width = 32+ → diminishing returns; computationally expensive.

Production note: very large beam widths can produce worse outputs because beam search promotes high-probability but generic continuations. This is why nucleus sampling (Exercise 08) has replaced beam search for open-ended generation in many domains.

When beam search is right

  • Translation. The output has a clear "best" sequence; beam search converges to it.
  • Constrained generation. When the search space is small and you want the best path.
  • Caption generation, structured output. Where coherence matters more than diversity.

When beam search is wrong

  • Open-ended generation (chat, creative writing). Beam search produces bland, "safe" outputs that maximise log prob but lack interesting variation. Sampling (with temperature or nucleus) wins here.
  • Long outputs where beam search degenerates — paths converge to similar text; diversity collapses.

Common bugs the tests catch

  • Forgetting to normalise length. Rankings favour <eos> immediately after <bos>.
  • Selecting on normalised score during search instead of raw. Causes incorrect pruning; the in-progress beams haven't reached their final length yet.
  • Mixing finished and active beams. Active beams keep growing; finished beams don't.
  • Not handling max_len. Beams that don't emit <eos> need to be promoted to finished at the end.
  • Forgetting to expand every beam at every step. Easy to skip a beam if logic is off.

What this exercise teaches

  • Search algorithms with priority queues and pruning.
  • Length normalisation as the structural defence against short-beam bias.
  • The bridge between deterministic decoding (greedy, beam) and stochastic decoding (temperature, top-p).

Interview probes:

  • "Why does beam search favour short sequences without normalisation?"
  • "Why isn't beam width = vocabulary size always best?"
  • "When would you choose nucleus sampling over beam search?"
  • "How would you parallelise beam expansion across beams?"
  • "What is the time complexity of one beam search step?" (O(B × V × log(B × V)) with sorting; O(B × V) with a heap.)