Lesson 8: Entropy & Information

Information Content

Surprise of an event:

# Information: I(x) = -log P(x)

# Rare event (low P) → high information
# Common event (high P) → low information

# Example: P(rain in desert) = 0.01
# I(rain) = -log(0.01) ≈ 4.6 bits
      

Entropy

Expected information (uncertainty):

# H(X) = -sum P(x) log P(x)

# Uniform distribution: high entropy (uncertain)
# Peaked distribution: low entropy (certain)

# Fair coin: H = 1 bit
# Always heads: H = 0 bits
      

Cross-Entropy

# H(p,q) = -sum p(x) log q(x)
# p = true distribution
# q = predicted distribution

# Measures how well q approximates p
# = Negative log-likelihood!
      

📝 Practice Exercises

Exercise 1: Calculate Information Content

Given the following events and their probabilities, calculate the information content in bits:

Event A: P(A) = 0.5
Event B: P(B) = 0.25
Event C: P(C) = 0.125

Your Task: Write a Python function that calculates I(x) = -log₂(P(x)) for each event.

# Solution Template
def information_content(probability):
    """Calculate information content in bits."""
    import math
    return -math.log2(probability)

# Test with the given events
events = {'A': 0.5, 'B': 0.25, 'C': 0.125}
for name, prob in events.items():
    info = information_content(prob)
    print(f"Event {name}: {info} bits")

# Expected output:
# Event A: 1.0 bits
# Event B: 2.0 bits  
# Event C: 3.0 bits
      

Exercise 2: Compute Entropy of a Distribution

A language model predicts the next token with the following probability distribution:

P("the") = 0.4
P("a") = 0.3
P("an") = 0.2
P("this") = 0.1

Your Task: Calculate the entropy H(X) = -Σ P(x) log₂ P(x). What does this value tell you about the model's uncertainty?

# Solution Template
def calculate_entropy(probabilities):
    """Calculate Shannon entropy in bits."""
    import math
    entropy = 0
    for p in probabilities:
        if p > 0:  # Avoid log(0)
            entropy -= p * math.log2(p)
    return entropy

# Model predictions
probs = [0.4, 0.3, 0.2, 0.1]
entropy = calculate_entropy(probs)
print(f"Entropy: {entropy:.3f} bits")

# Interpretation:
# - Maximum entropy for 4 outcomes would be log2(4) = 2 bits (uniform)
# - Lower entropy means the model is more confident
# - This model has moderate uncertainty (1.846 bits)
      

Exercise 3: Cross-Entropy Loss

Given the true distribution (one-hot) and predicted probabilities below, compute the cross-entropy loss:

True: "cat" (index 0)
Predicted: [0.7, 0.2, 0.1] for ["cat", "dog", "bird"]

Your Task: Implement cross-entropy H(p,q) = -Σ p(x) log q(x) and explain why minimizing it maximizes likelihood.

# Solution Template
def cross_entropy(true_index, predictions):
    """
    Calculate cross-entropy loss.
    true_index: index of the correct class (p=1 for this index)
    predictions: probability distribution over all classes
    """
    import math
    # Since p is one-hot, only the true class contributes
    return -math.log2(predictions[true_index])

# Example: True class is "cat" (index 0)
true_idx = 0
preds = [0.7, 0.2, 0.1]  # Model predictions
loss = cross_entropy(true_idx, preds)
print(f"Cross-entropy loss: {loss:.3f} bits")

# Compare with a better prediction:
better_preds = [0.9, 0.05, 0.05]
better_loss = cross_entropy(true_idx, better_preds)
print(f"Better prediction loss: {better_loss:.3f} bits")

# Key insight: Higher confidence in correct answer → lower loss
      

Exercise 4: KL Divergence Implementation

The Kullback-Leibler (KL) divergence measures how one probability distribution diverges from a second, expected probability distribution:

D_KL(P || Q) = Σ P(x) log(P(x) / Q(x))

Your Task: Implement KL divergence and use it to compare a model's predictions against the true distribution.

# Solution Template
import math

def kl_divergence(p, q):
    """
    Calculate KL divergence D_KL(P || Q).
    p: true distribution (list of probabilities)
    q: predicted distribution (list of probabilities)
    """
    kl = 0
    for pi, qi in zip(p, q):
        if pi > 0 and qi > 0:
            kl += pi * math.log2(pi / qi)
    return kl

# True distribution (from training data)
true_dist = [0.5, 0.3, 0.2]

# Model A: slightly off
model_a = [0.45, 0.35, 0.2]

# Model B: very confident but wrong
model_b = [0.8, 0.1, 0.1]

print(f"KL(Model A || True): {kl_divergence(true_dist, model_a):.4f}")
print(f"KL(Model B || True): {kl_divergence(true_dist, model_b):.4f}")

# Note: KL divergence is not symmetric!
print(f"KL(True || Model A): {kl_divergence(model_a, true_dist):.4f}")
      

Exercise 5: Perplexity Calculation

Perplexity is a common metric in language modeling, defined as 2^(cross-entropy). Lower perplexity means the model is less "perplexed" by the data.

Your Task: Calculate perplexity for a sequence of tokens and interpret what it means for model quality.

# Solution Template
import math

def perplexity(cross_entropy):
    """
    Calculate perplexity from cross-entropy.
    Perplexity = 2^H(p,q)
    """
    return 2 ** cross_entropy

# Example: Model's average cross-entropy on a test set
avg_cross_entropy = 2.5  # bits per token

ppl = perplexity(avg_cross_entropy)
print(f"Perplexity: {ppl:.2f}")

# Interpretation:
# - Perplexity of 5.66 means the model is as uncertain
#   as if it had to choose uniformly among 5.66 options
# - Human-level perplexity on English: ~10-20
# - GPT-4 level perplexity: ~8-12
# - Random guessing (vocab size 50k): ~50000

# Calculate for different models
models = {
    "Random": math.log2(50000),
    "Basic LM": 6.0,
    "Good LM": 3.5,
    "Great LM": 2.5
}

for name, ce in models.items():
    ppl = perplexity(ce)
    print(f"{name}: CE={ce:.2f}, PPL={ppl:.1f}")
      

💡 Practical Examples

Example 1: Information in Language Models

When a language model predicts the next token, the information content reveals how "surprising" each prediction is:

# Context: "The capital of France is"
# Model predictions:
#   "Paris": 0.85 probability → I = -log₂(0.85) = 0.23 bits
#   "Lyon": 0.10 probability → I = -log₂(0.10) = 3.32 bits  
#   "Berlin": 0.05 probability → I = -log₂(0.05) = 4.32 bits

# High-probability predictions (like "Paris") convey little information
# because they're expected. Rare predictions convey more information.
        

Key takeaway: Well-trained models produce low information content on average because they learn to predict likely tokens.

Example 2: Entropy in Decision Trees

Entropy helps decide which feature splits data most effectively:

# Dataset: 10 animals, classify as mammal or reptile
# Feature: "Has fur?"

# Before split: 5 mammals, 5 reptiles
# H = -0.5*log₂(0.5) - 0.5*log₂(0.5) = 1 bit (maximum uncertainty)

# After split on "Has fur":
#   - Has fur (6 animals): 5 mammals, 1 reptile
#     H = -5/6*log₂(5/6) - 1/6*log₂(1/6) = 0.65 bits
#   
#   - No fur (4 animals): 0 mammals, 4 reptiles  
#     H = 0 bits (pure group!)

# Information gain = 1 - (6/10*0.65 + 4/10*0) = 0.61 bits
# This feature reduces uncertainty significantly.
        

Key takeaway: Lower entropy after a split means the feature is useful for classification.

Example 3: Cross-Entropy in Neural Networks

Cross-entropy loss drives neural network training for classification:

import torch
import torch.nn.functional as F

# 3-class classification: cat, dog, bird
# True label: cat (index 0)
true_label = torch.tensor([0])  # Class index

# Model outputs (logits) before softmax
logits = torch.tensor([[2.0, 1.0, 0.1]])

# Convert to probabilities with softmax
probs = F.softmax(logits, dim=1)
# probs = [[0.659, 0.242, 0.099]]

# Cross-entropy loss (PyTorch combines softmax + NLL)
loss = F.cross_entropy(logits, true_label)
# loss = -log(0.659) = 0.417

# Compare with confident wrong prediction:
wrong_logits = torch.tensor([[0.1, 2.0, 1.0]])  # Thinks "dog"
wrong_loss = F.cross_entropy(wrong_logits, true_label)
# wrong_loss = -log(0.099) = 3.34 (much higher!)

# Training minimizes this loss, pushing model to be confident AND correct
        

Key takeaway: Cross-entropy penalizes confident wrong predictions heavily, making the model cautious about its certainty.