Information theory basics. Cross-entropy loss derivation.
Information Content
Surprise of an event:
# Information: I(x) = -log P(x)
# Rare event (low P) → high information
# Common event (high P) → low information
# Example: P(rain in desert) = 0.01
# I(rain) = -log(0.01) ≈ 4.6 bits
Entropy
Expected information (uncertainty):
# H(X) = -sum P(x) log P(x)
# Uniform distribution: high entropy (uncertain)
# Peaked distribution: low entropy (certain)
# Fair coin: H = 1 bit
# Always heads: H = 0 bits
Cross-Entropy
# H(p,q) = -sum p(x) log q(x)
# p = true distribution
# q = predicted distribution
# Measures how well q approximates p
# = Negative log-likelihood!
📝 Practice Exercises
Exercise 1: Calculate Information Content
Given the following events and their probabilities, calculate the information content in bits:
Event A: P(A) = 0.5
Event B: P(B) = 0.25
Event C: P(C) = 0.125
Your Task: Write a Python function that calculates I(x) = -log₂(P(x)) for each event.
# Solution Template
def information_content(probability):
"""Calculate information content in bits."""
import math
return -math.log2(probability)
# Test with the given events
events = {'A': 0.5, 'B': 0.25, 'C': 0.125}
for name, prob in events.items():
info = information_content(prob)
print(f"Event {name}: {info} bits")
# Expected output:
# Event A: 1.0 bits
# Event B: 2.0 bits
# Event C: 3.0 bits
Exercise 2: Compute Entropy of a Distribution
A language model predicts the next token with the following probability distribution:
P("the") = 0.4
P("a") = 0.3
P("an") = 0.2
P("this") = 0.1
Your Task: Calculate the entropy H(X) = -Σ P(x) log₂ P(x). What does this value tell you about the model's uncertainty?
# Solution Template
def calculate_entropy(probabilities):
"""Calculate Shannon entropy in bits."""
import math
entropy = 0
for p in probabilities:
if p > 0: # Avoid log(0)
entropy -= p * math.log2(p)
return entropy
# Model predictions
probs = [0.4, 0.3, 0.2, 0.1]
entropy = calculate_entropy(probs)
print(f"Entropy: {entropy:.3f} bits")
# Interpretation:
# - Maximum entropy for 4 outcomes would be log2(4) = 2 bits (uniform)
# - Lower entropy means the model is more confident
# - This model has moderate uncertainty (1.846 bits)
Exercise 3: Cross-Entropy Loss
Given the true distribution (one-hot) and predicted probabilities below, compute the cross-entropy loss:
True: "cat" (index 0)
Predicted: [0.7, 0.2, 0.1] for ["cat", "dog", "bird"]
Your Task: Implement cross-entropy H(p,q) = -Σ p(x) log q(x) and explain why minimizing it maximizes likelihood.
# Solution Template
def cross_entropy(true_index, predictions):
"""
Calculate cross-entropy loss.
true_index: index of the correct class (p=1 for this index)
predictions: probability distribution over all classes
"""
import math
# Since p is one-hot, only the true class contributes
return -math.log2(predictions[true_index])
# Example: True class is "cat" (index 0)
true_idx = 0
preds = [0.7, 0.2, 0.1] # Model predictions
loss = cross_entropy(true_idx, preds)
print(f"Cross-entropy loss: {loss:.3f} bits")
# Compare with a better prediction:
better_preds = [0.9, 0.05, 0.05]
better_loss = cross_entropy(true_idx, better_preds)
print(f"Better prediction loss: {better_loss:.3f} bits")
# Key insight: Higher confidence in correct answer → lower loss
Exercise 4: KL Divergence Implementation
The Kullback-Leibler (KL) divergence measures how one probability distribution diverges from a second, expected probability distribution:
D_KL(P || Q) = Σ P(x) log(P(x) / Q(x))
Your Task: Implement KL divergence and use it to compare a model's predictions against the true distribution.
# Solution Template
import math
def kl_divergence(p, q):
"""
Calculate KL divergence D_KL(P || Q).
p: true distribution (list of probabilities)
q: predicted distribution (list of probabilities)
"""
kl = 0
for pi, qi in zip(p, q):
if pi > 0 and qi > 0:
kl += pi * math.log2(pi / qi)
return kl
# True distribution (from training data)
true_dist = [0.5, 0.3, 0.2]
# Model A: slightly off
model_a = [0.45, 0.35, 0.2]
# Model B: very confident but wrong
model_b = [0.8, 0.1, 0.1]
print(f"KL(Model A || True): {kl_divergence(true_dist, model_a):.4f}")
print(f"KL(Model B || True): {kl_divergence(true_dist, model_b):.4f}")
# Note: KL divergence is not symmetric!
print(f"KL(True || Model A): {kl_divergence(model_a, true_dist):.4f}")
Exercise 5: Perplexity Calculation
Perplexity is a common metric in language modeling, defined as 2^(cross-entropy). Lower perplexity means the model is less "perplexed" by the data.
Your Task: Calculate perplexity for a sequence of tokens and interpret what it means for model quality.
# Solution Template
import math
def perplexity(cross_entropy):
"""
Calculate perplexity from cross-entropy.
Perplexity = 2^H(p,q)
"""
return 2 ** cross_entropy
# Example: Model's average cross-entropy on a test set
avg_cross_entropy = 2.5 # bits per token
ppl = perplexity(avg_cross_entropy)
print(f"Perplexity: {ppl:.2f}")
# Interpretation:
# - Perplexity of 5.66 means the model is as uncertain
# as if it had to choose uniformly among 5.66 options
# - Human-level perplexity on English: ~10-20
# - GPT-4 level perplexity: ~8-12
# - Random guessing (vocab size 50k): ~50000
# Calculate for different models
models = {
"Random": math.log2(50000),
"Basic LM": 6.0,
"Good LM": 3.5,
"Great LM": 2.5
}
for name, ce in models.items():
ppl = perplexity(ce)
print(f"{name}: CE={ce:.2f}, PPL={ppl:.1f}")
💡 Practical Examples
Example 1: Information in Language Models
When a language model predicts the next token, the information content reveals how "surprising" each prediction is:
# Context: "The capital of France is"
# Model predictions:
# "Paris": 0.85 probability → I = -log₂(0.85) = 0.23 bits
# "Lyon": 0.10 probability → I = -log₂(0.10) = 3.32 bits
# "Berlin": 0.05 probability → I = -log₂(0.05) = 4.32 bits
# High-probability predictions (like "Paris") convey little information
# because they're expected. Rare predictions convey more information.
Key takeaway: Well-trained models produce low information content on average because they learn to predict likely tokens.
Example 2: Entropy in Decision Trees
Entropy helps decide which feature splits data most effectively:
# Dataset: 10 animals, classify as mammal or reptile
# Feature: "Has fur?"
# Before split: 5 mammals, 5 reptiles
# H = -0.5*log₂(0.5) - 0.5*log₂(0.5) = 1 bit (maximum uncertainty)
# After split on "Has fur":
# - Has fur (6 animals): 5 mammals, 1 reptile
# H = -5/6*log₂(5/6) - 1/6*log₂(1/6) = 0.65 bits
#
# - No fur (4 animals): 0 mammals, 4 reptiles
# H = 0 bits (pure group!)
# Information gain = 1 - (6/10*0.65 + 4/10*0) = 0.61 bits
# This feature reduces uncertainty significantly.
Key takeaway: Lower entropy after a split means the feature is useful for classification.
Example 3: Cross-Entropy in Neural Networks
Cross-entropy loss drives neural network training for classification:
import torch
import torch.nn.functional as F
# 3-class classification: cat, dog, bird
# True label: cat (index 0)
true_label = torch.tensor([0]) # Class index
# Model outputs (logits) before softmax
logits = torch.tensor([[2.0, 1.0, 0.1]])
# Convert to probabilities with softmax
probs = F.softmax(logits, dim=1)
# probs = [[0.659, 0.242, 0.099]]
# Cross-entropy loss (PyTorch combines softmax + NLL)
loss = F.cross_entropy(logits, true_label)
# loss = -log(0.659) = 0.417
# Compare with confident wrong prediction:
wrong_logits = torch.tensor([[0.1, 2.0, 1.0]]) # Thinks "dog"
wrong_loss = F.cross_entropy(wrong_logits, true_label)
# wrong_loss = -log(0.099) = 3.34 (much higher!)
# Training minimizes this loss, pushing model to be confident AND correct
Key takeaway: Cross-entropy penalizes confident wrong predictions heavily, making the model cautious about its certainty.