Why Sampling Matters
Language models output probabilities for every possible next token. But how do we choose which token to actually generate? This is where sampling strategies come in.
Greedy vs. Sampling
Two Approaches
Greedy Decoding
Always pick the token with highest probability
• Deterministic
• High accuracy
Cons:
• Repetitive
• Boring outputs
• No creativity
Sampling
Randomly pick based on probabilities
• Creative
• Varied outputs
• Human-like
Cons:
• Random errors
• Inconsistent
Temperature: Controlling Randomness
Temperature is a hyperparameter that controls how "random" the sampling is. It scales the logits (raw model outputs) before applying softmax.
Where T is temperature:
- T → 0: Becomes greedy (always pick highest)
- T = 1: Use model's natural probabilities
- T > 1: More random, flatter distribution
- T < 1: Less random, sharper distribution
Interactive Temperature Demo
Probabilities for next token given: "The capital of France is"
When to Use Different Temperatures
| Temperature | Use Case | Example |
|---|---|---|
| 0.0 - 0.3 | Code generation, factual Q&A | "Write a Python function to sort a list" |
| 0.3 - 0.7 | Balanced tasks, conversation | "Explain quantum computing" |
| 0.7 - 1.0 | Creative writing, brainstorming | "Write a story about a robot" |
| 1.0 - 1.5 | Highly creative, experimental | "Generate unusual business ideas" |
| > 1.5 | Wild creativity (often incoherent) | "Surrealist poetry" |
Top-k Sampling
One problem with pure sampling: the model might pick extremely unlikely tokens that lead to nonsense. Top-k sampling restricts sampling to only the k most likely tokens.
1. Sort all tokens by probability
2. Keep only the top k tokens
3. Renormalize probabilities to sum to 1
4. Sample from these k tokens
Top-k Example
Given prompt: "The color of the sky is"
2. " usually" — 15%
3. " often" — 12%
4. " typically" — 8%
5. " azure" — 5%
6. " clear" — 4%
7. " bright" — 3%
8. " cyan" — 2%
... (50,000 more tokens)
50000. " purple" — 0.0001%
Only consider: blue, usually, often, typically, azure
(Renormalized probabilities)
Consider top 50 tokens
More variety, but still filtered
Common values: k=40 to k=50 works well for most applications. Higher k = more variety.
Top-p (Nucleus) Sampling
Top-k has a problem: the "right" k varies depending on the situation. Sometimes the top 5 tokens cover 95% of probability mass; sometimes you need top 50. Top-p sampling (also called nucleus sampling) solves this dynamically.
1. Sort tokens by probability
2. Keep the smallest set of tokens whose cumulative probability ≥ p
3. Sample from this "nucleus"
Top-p Example
Same prompt: "The color of the sky is"
" usually" — 15% (cumulative: 60%)
" often" — 12% (cumulative: 72%)
" typically" — 8% (cumulative: 80%) ← top-p=0.8 would stop here
" azure" — 5% (cumulative: 85%)
" clear" — 4% (cumulative: 89%)
" bright" — 3% (cumulative: 92%) ← top-p=0.9 would stop here
" cyan" — 2% (cumulative: 94%)
With top-p=0.5: Sample from only {" blue", " usually", " often"} (cumulative 72% > 50%)
Why top-p is better: It adapts to the model's confidence. When the model is confident (one token has 90% probability), top-p=0.9 might only include 2-3 tokens. When uncertain, it might include 50.
Combining Strategies
In practice, these strategies are often combined:
Recommended Settings by Task
| Task | Temp | Top-p | Top-k |
|---|---|---|---|
| Code generation | 0.2 | 0.95 | 40 |
| Math/reasoning | 0.0 | 1.0 | — |
| General conversation | 0.7 | 0.9 | 50 |
| Creative writing | 0.9 | 0.95 | 100 |
| Brainstorming | 1.0 | 0.99 | 200 |
Other Sampling Parameters
Repetition Penalty
Prevents the model from repeating the same words. Works by reducing the probability of tokens that have already appeared in the output.
Frequency Penalty
Similar to repetition penalty but scales with how many times the token has appeared. Tokens that appear many times get increasingly penalized.
Presence Penalty
A fixed penalty applied to any token that has appeared at all (regardless of frequency). Encourages topic diversity.
Max Tokens
Hard limit on how many tokens to generate. Important for:
- Controlling costs (you pay per token)
- Preventing runaway generation
- Fitting within context window
Practical Exercises
Exercise 1: Temperature Effects
For the prompt "The best programming language is", predict how the completions would differ at:
- Temperature = 0.1
- Temperature = 1.0
- Temperature = 2.0
Exercise 2: Top-k vs Top-p
Imagine a scenario where the model outputs probabilities: [0.5, 0.3, 0.1, 0.05, 0.05] for 5 tokens, and near-zero for all others. Which would be more restrictive: top-k=3 or top-p=0.9?
Exercise 3: Sampling Strategy Design
You're building a code completion tool. The user is writing a function definition. What sampling parameters would you use and why?