Lesson 5: Temperature and Sampling

Why Sampling Matters

Language models output probabilities for every possible next token. But how do we choose which token to actually generate? This is where sampling strategies come in.

        The Core Question: Given P("the") = 0.4, P("a") = 0.3, P("this") = 0.2, etc., 
        which token should we pick? Always pick the highest? Or sample randomly?
      

Greedy vs. Sampling

Two Approaches

Greedy Decoding

Always pick the token with highest probability

              Pros:

              • Deterministic

              • High accuracy

              Cons:

              • Repetitive

              • Boring outputs

              • No creativity

Sampling

Randomly pick based on probabilities

              Pros:

              • Creative

              • Varied outputs

              • Human-like

              Cons:

              • Random errors

              • Inconsistent

Temperature: Controlling Randomness

Temperature is a hyperparameter that controls how "random" the sampling is. It scales the logits (raw model outputs) before applying softmax.

P(x) = exp(logit(x) / T) / Σ exp(logit(i) / T)

Where T is temperature:

T → 0: Becomes greedy (always pick highest)
T = 1: Use model's natural probabilities
T > 1: More random, flatter distribution
T < 1: Less random, sharper distribution

Interactive Temperature Demo

T = 1.0

Focused (0.1) Balanced (1.0) Creative (2.0)

Probabilities for next token given: "The capital of France is"

When to Use Different Temperatures

Temperature	Use Case	Example
0.0 - 0.3	Code generation, factual Q&A	"Write a Python function to sort a list"
0.3 - 0.7	Balanced tasks, conversation	"Explain quantum computing"
0.7 - 1.0	Creative writing, brainstorming	"Write a story about a robot"
1.0 - 1.5	Highly creative, experimental	"Generate unusual business ideas"
> 1.5	Wild creativity (often incoherent)	"Surrealist poetry"

Top-k Sampling

One problem with pure sampling: the model might pick extremely unlikely tokens that lead to nonsense. Top-k sampling restricts sampling to only the k most likely tokens.

        Top-k Algorithm:

        1. Sort all tokens by probability

        2. Keep only the top k tokens

        3. Renormalize probabilities to sum to 1

        4. Sample from these k tokens

Top-k Example

Given prompt: "The color of the sky is"

All tokens (sorted by probability):

            1. " blue" — 45%

            2. " usually" — 15%

            3. " often" — 12%

            4. " typically" — 8%

            5. " azure" — 5%

            6. " clear" — 4%

            7. " bright" — 3%

            8. " cyan" — 2%

            ... (50,000 more tokens)

            50000. " purple" — 0.0001%

Top-k=5:
Only consider: blue, usually, often, typically, azure
(Renormalized probabilities)

Top-k=50:
Consider top 50 tokens
More variety, but still filtered

Common values: k=40 to k=50 works well for most applications. Higher k = more variety.

Top-p (Nucleus) Sampling

Top-k has a problem: the "right" k varies depending on the situation. Sometimes the top 5 tokens cover 95% of probability mass; sometimes you need top 50. Top-p sampling (also called nucleus sampling) solves this dynamically.

        Top-p Algorithm:

        1. Sort tokens by probability

        2. Keep the smallest set of tokens whose cumulative probability ≥ p

        3. Sample from this "nucleus"

Top-p Example

Same prompt: "The color of the sky is"

Cumulative probabilities:

            " blue" — 45% (cumulative: 45%)

            " usually" — 15% (cumulative: 60%)

            " often" — 12% (cumulative: 72%)

            " typically" — 8% (cumulative: 80%) ← top-p=0.8 would stop here

            " azure" — 5% (cumulative: 85%)

            " clear" — 4% (cumulative: 89%)

            " bright" — 3% (cumulative: 92%) ← top-p=0.9 would stop here

            " cyan" — 2% (cumulative: 94%)

With top-p=0.9: Sample from {" blue", " usually", " often", " typically", " azure", " clear", " bright"}

With top-p=0.5: Sample from only {" blue", " usually", " often"} (cumulative 72% > 50%)

Why top-p is better: It adapts to the model's confidence. When the model is confident (one token has 90% probability), top-p=0.9 might only include 2-3 tokens. When uncertain, it might include 50.

Combining Strategies

In practice, these strategies are often combined:

# Typical sampling configuration
{
    "temperature": 0.7,    # Moderate randomness
    "top_p": 0.9,        # Nucleus sampling
    "top_k": 50,       # Additional safety cap
    "max_tokens": 500   # Limit response length
}
      

Recommended Settings by Task

Task	Temp	Top-p	Top-k
Code generation	0.2	0.95	40
Math/reasoning	0.0	1.0	—
General conversation	0.7	0.9	50
Creative writing	0.9	0.95	100
Brainstorming	1.0	0.99	200

Other Sampling Parameters

Repetition Penalty

Prevents the model from repeating the same words. Works by reducing the probability of tokens that have already appeared in the output.

# Repetition penalty = 1.2 means:
# If token "the" has probability 0.5 and already appeared:
# New probability = 0.5 / 1.2 = 0.42
      

Frequency Penalty

Similar to repetition penalty but scales with how many times the token has appeared. Tokens that appear many times get increasingly penalized.

Presence Penalty

A fixed penalty applied to any token that has appeared at all (regardless of frequency). Encourages topic diversity.

Max Tokens

Hard limit on how many tokens to generate. Important for:

Controlling costs (you pay per token)
Preventing runaway generation
Fitting within context window

Practical Exercises

Exercise 1: Temperature Effects

For the prompt "The best programming language is", predict how the completions would differ at:

Temperature = 0.1
Temperature = 1.0
Temperature = 2.0

Exercise 2: Top-k vs Top-p

Imagine a scenario where the model outputs probabilities: [0.5, 0.3, 0.1, 0.05, 0.05] for 5 tokens, and near-zero for all others. Which would be more restrictive: top-k=3 or top-p=0.9?

Exercise 3: Sampling Strategy Design

You're building a code completion tool. The user is writing a function definition. What sampling parameters would you use and why?