Skip to content

07. Temperature and sampling — the creativity dial is real

~12 min read. Prompt quality matters, but decoding settings still decide how strictly the model follows one path.

Built on the ELI5 in 00-eli5.md. The Creativity dial — how strict or exploratory the contractor should be — changes variety, risk, and repeatability.


What temperature is doing

See. The model does not pick words with certainty. It produces probabilities over next tokens. Sampling decides how we choose from that distribution. Temperature changes how sharp or flat that distribution feels. Low temperature makes the high-probability tokens dominate more. High temperature spreads probability mass more widely. That invites variety. It also invites drift.

Picture first.

low temperature                    high temperature
┌────────────────────┐             ┌────────────────────┐
│ top token: 0.70    │             │ top token: 0.40    │
│ next token: 0.20   │             │ next token: 0.25   │
│ rest: tiny         │             │ many tokens alive  │
└─────────┬──────────┘             └─────────┬──────────┘
          ▼                                  ▼
   repeatable wording                  varied wording

Simple, no? A strong prompt plus low temperature often feels disciplined. A strong prompt plus high temperature feels inventive. A weak prompt plus high temperature feels chaotic. This is why decoding cannot rescue bad prompt design. It only changes how the model explores the space your prompt created.

Now what is the problem? Teams sometimes blame prompt wording for behavior that is really a sampling issue. Or they set temperature to zero, then wonder why brainstorming feels stale. The Creativity dial must match the job.

Temperature, top_p, and top_k

Temperature is only one knob. Top_p keeps the smallest token set whose cumulative probability crosses a threshold. Top_k keeps only the top k candidate tokens. All three change exploration, but in different ways.

full token list
┌────────────────────────────────────┐
│ t1 t2 t3 t4 t5 t6 t7 t8 ...        │
└────────────────────────────────────┘
    │                 │
    ├── top_k=3 ──→ keep t1 t2 t3 only
    └── top_p=0.9 ─→ keep tokens until total mass reaches 0.9

Look. Low temperature sharpens ranking. Top_k truncates the tail by count. Top_p truncates the tail by probability mass. Different model APIs expose different combinations. If you have only temperature, use it carefully. If you have top_p too, remember both knobs interact. Too much freedom on both can create wild drift. Too much restriction on both can make output robotic or repetitive.

For many production tasks, teams keep top_p near its default, then tune temperature first. That is often simpler. For creative generation, a modest temperature increase may be enough. For deterministic extraction, keep the Creativity dial low and the Reply form strict.

Choosing settings by task

Now let us map task to decoding. Customer-support policy answers usually want low creativity. Classification wants very low creativity. Code transformation wants low to moderate creativity, depending on whether you want preservation or exploration. Brainstorming slogans or product names can tolerate higher creativity. So what to do? Start from the task's failure cost.

high-cost error tasks                 high-variety tasks
┌──────────────────────┐             ┌──────────────────────┐
│ support policy       │             │ campaign ideas       │
│ extraction           │             │ naming concepts      │
│ code migration       │             │ writing prompts      │
└──────────┬───────────┘             └──────────┬───────────┘
           ▼                                    ▼
     lower temperature                    higher temperature

If the user sees one answer and trusts it, keep creativity lower. If the user is selecting from options, you can allow more diversity. If a parser will consume the answer, keep creativity low and structure high. If the goal is idea generation, ask for multiple candidates and allow more spread. Simple, no? Risk tolerance should set the dial.

One more subtlety. Sampling changes more than wording. It can change reasoning path, citation choice, and even whether the model obeys soft instructions. That is why prompt A/B tests must control decoding settings. Otherwise you are not testing only the prompt.

Worked example — same prompt, different dial

Suppose you want subject lines for a billing email. Prompt:

You are a SaaS billing copy assistant.
Write three subject lines for an email about failed card renewal.
Keep them clear, calm, and under 8 words.

Possible low-temperature output.

1. Update your payment method
2. Action needed: payment failed
3. Renew your plan successfully

Possible higher-temperature output.

1. Your renewal needs a quick fix
2. Payment failed — keep service active
3. Let us get your plan back on track

See the difference. The low-temperature set is safer and plainer. The higher-temperature set explores more phrasing. Neither is automatically better. It depends on the goal.

Now a high-risk example. Prompt:

Classify this message as billing, bug, feature_request, or account_access.
Return the label only.
User: I was charged twice after renewal.

Low-temperature output.

billing

Higher-temperature output could still be correct. But it increases the chance of weird extras like,

billing — this seems related to duplicate payment

That tiny flourish is enough to break a parser. So what to do? For classification, keep the Creativity dial low. Use a strict Reply form. Do not pay extra for useless variety.

Sampling is part of prompt engineering

Some people separate prompt design and decoding design. In practice, they are married. A prompt defines the desired behavior region. Sampling defines how much wandering is allowed inside that region. You need both.

The Revision ledger should record temperature, top_p, and top_k alongside prompt text. Otherwise a future engineer will change the prompt, forget the dial changed too, and misread the experiment. This happens often. Write the full inference config down.

Also, measure repeatability. If your workflow needs consistent output, run the same prompt multiple times. Count variance. A beautiful answer that appears once and disappears later is not a stable product behavior.


Where this lives in the wild

  • Customer-support bots on OpenAI, Claude, or Bedrock — operations teams keep temperature low because grounded policy answers must be repeatable and boring in the best way.
  • Marketing content tools in Jasper or Copy.ai — content designers often raise temperature or sample multiple candidates because variety itself is part of the product value.
  • GitHub Copilot code transformations — lower creativity is preferred for refactors or fixups, while suggestion-generation features can tolerate more exploration.
  • Perplexity or enterprise search answers — grounded response pipelines usually keep sampling conservative so citations and factual phrasing do not drift too much run to run.
  • A/B prompt experiment platforms — ML engineers log temperature, top_p, and seed-like controls because uncontrolled decoding can fake prompt wins or losses.

Pause and recall

  • What does lower temperature usually do to the token distribution?
  • How is top_p different from top_k?
  • Why should task risk, not taste, decide the creativity dial?
  • Why must prompt experiments record decoding settings too?

Interview Q&A

Q: Why can a strong prompt still produce unstable behavior at high temperature? A: Because the prompt narrows the target region, but high-temperature sampling still allows broader exploration within that region and sometimes beyond its softer boundaries.

Common wrong answer to avoid: "A good prompt makes decoding settings irrelevant." Prompt quality helps, but decoding still affects variation and compliance.

Q: Why is temperature usually tuned before top_p or top_k in many production systems? A: Temperature is often the simplest and most interpretable control for strictness versus variety. Extra knobs add interaction effects that can complicate debugging.

Common wrong answer to avoid: "Because top_p and top_k are obsolete." They are not obsolete. They are just additional controls.

Q: Why should classification and extraction tasks usually use low creativity settings? A: These tasks value repeatable state selection and parser-safe output more than linguistic variety. Extra exploration buys little and can break integrations.

Common wrong answer to avoid: "Because the model becomes smarter at low temperature." It becomes more deterministic, not more intelligent.

Q: Why must sampling settings be versioned with prompts? A: Because output quality is a function of both instruction text and decoding behavior. Without both, experiment results are not reproducible.

Common wrong answer to avoid: "The prompt is the only thing that matters." In practice, inference config matters too.


Apply now (5 min)

Exercise. Take one prompt from your domain. Decide whether the task is high-risk, parser-facing, or creative. Then choose a low, medium, or high Creativity dial setting and defend it in one sentence. Add whether top_p or top_k should stay default.

Sketch from memory. Draw two token hills. Make one sharp for low temperature. Make one flatter for high temperature. Write, "Prompt defines the region. Sampling defines the wandering."


Bridge. A single prompt can do only so much. When the job has multiple stages, we should stop forcing one giant instruction and instead chain smaller prompts together. → 08-prompt-chaining.md