🚧 Lesson 8 of 25 in Level 05
Level 05 • Lesson 8

Entropy & Information

Information theory basics. Cross-entropy loss derivation.

Information Content

Surprise of an event:

# Information: I(x) = -log P(x) # Rare event (low P) → high information # Common event (high P) → low information # Example: P(rain in desert) = 0.01 # I(rain) = -log(0.01) ≈ 4.6 bits

Entropy

Expected information (uncertainty):

# H(X) = -sum P(x) log P(x) # Uniform distribution: high entropy (uncertain) # Peaked distribution: low entropy (certain) # Fair coin: H = 1 bit # Always heads: H = 0 bits

Cross-Entropy

# H(p,q) = -sum p(x) log q(x) # p = true distribution # q = predicted distribution # Measures how well q approximates p # = Negative log-likelihood!

📝 Practice Exercises

Exercise 1: Calculate Information Content

Given the following events and their probabilities, calculate the information content in bits:

  • Event A: P(A) = 0.5
  • Event B: P(B) = 0.25
  • Event C: P(C) = 0.125

Your Task: Write a Python function that calculates I(x) = -log₂(P(x)) for each event.

# Solution Template def information_content(probability): """Calculate information content in bits.""" import math return -math.log2(probability) # Test with the given events events = {'A': 0.5, 'B': 0.25, 'C': 0.125} for name, prob in events.items(): info = information_content(prob) print(f"Event {name}: {info} bits") # Expected output: # Event A: 1.0 bits # Event B: 2.0 bits # Event C: 3.0 bits

Exercise 2: Compute Entropy of a Distribution

A language model predicts the next token with the following probability distribution:

  • P("the") = 0.4
  • P("a") = 0.3
  • P("an") = 0.2
  • P("this") = 0.1

Your Task: Calculate the entropy H(X) = -Σ P(x) log₂ P(x). What does this value tell you about the model's uncertainty?

# Solution Template def calculate_entropy(probabilities): """Calculate Shannon entropy in bits.""" import math entropy = 0 for p in probabilities: if p > 0: # Avoid log(0) entropy -= p * math.log2(p) return entropy # Model predictions probs = [0.4, 0.3, 0.2, 0.1] entropy = calculate_entropy(probs) print(f"Entropy: {entropy:.3f} bits") # Interpretation: # - Maximum entropy for 4 outcomes would be log2(4) = 2 bits (uniform) # - Lower entropy means the model is more confident # - This model has moderate uncertainty (1.846 bits)

Exercise 3: Cross-Entropy Loss

Given the true distribution (one-hot) and predicted probabilities below, compute the cross-entropy loss:

  • True: "cat" (index 0)
  • Predicted: [0.7, 0.2, 0.1] for ["cat", "dog", "bird"]

Your Task: Implement cross-entropy H(p,q) = -Σ p(x) log q(x) and explain why minimizing it maximizes likelihood.

# Solution Template def cross_entropy(true_index, predictions): """ Calculate cross-entropy loss. true_index: index of the correct class (p=1 for this index) predictions: probability distribution over all classes """ import math # Since p is one-hot, only the true class contributes return -math.log2(predictions[true_index]) # Example: True class is "cat" (index 0) true_idx = 0 preds = [0.7, 0.2, 0.1] # Model predictions loss = cross_entropy(true_idx, preds) print(f"Cross-entropy loss: {loss:.3f} bits") # Compare with a better prediction: better_preds = [0.9, 0.05, 0.05] better_loss = cross_entropy(true_idx, better_preds) print(f"Better prediction loss: {better_loss:.3f} bits") # Key insight: Higher confidence in correct answer → lower loss

Exercise 4: KL Divergence Implementation

The Kullback-Leibler (KL) divergence measures how one probability distribution diverges from a second, expected probability distribution:

D_KL(P || Q) = Σ P(x) log(P(x) / Q(x))

Your Task: Implement KL divergence and use it to compare a model's predictions against the true distribution.

# Solution Template import math def kl_divergence(p, q): """ Calculate KL divergence D_KL(P || Q). p: true distribution (list of probabilities) q: predicted distribution (list of probabilities) """ kl = 0 for pi, qi in zip(p, q): if pi > 0 and qi > 0: kl += pi * math.log2(pi / qi) return kl # True distribution (from training data) true_dist = [0.5, 0.3, 0.2] # Model A: slightly off model_a = [0.45, 0.35, 0.2] # Model B: very confident but wrong model_b = [0.8, 0.1, 0.1] print(f"KL(Model A || True): {kl_divergence(true_dist, model_a):.4f}") print(f"KL(Model B || True): {kl_divergence(true_dist, model_b):.4f}") # Note: KL divergence is not symmetric! print(f"KL(True || Model A): {kl_divergence(model_a, true_dist):.4f}")

Exercise 5: Perplexity Calculation

Perplexity is a common metric in language modeling, defined as 2^(cross-entropy). Lower perplexity means the model is less "perplexed" by the data.

Your Task: Calculate perplexity for a sequence of tokens and interpret what it means for model quality.

# Solution Template import math def perplexity(cross_entropy): """ Calculate perplexity from cross-entropy. Perplexity = 2^H(p,q) """ return 2 ** cross_entropy # Example: Model's average cross-entropy on a test set avg_cross_entropy = 2.5 # bits per token ppl = perplexity(avg_cross_entropy) print(f"Perplexity: {ppl:.2f}") # Interpretation: # - Perplexity of 5.66 means the model is as uncertain # as if it had to choose uniformly among 5.66 options # - Human-level perplexity on English: ~10-20 # - GPT-4 level perplexity: ~8-12 # - Random guessing (vocab size 50k): ~50000 # Calculate for different models models = { "Random": math.log2(50000), "Basic LM": 6.0, "Good LM": 3.5, "Great LM": 2.5 } for name, ce in models.items(): ppl = perplexity(ce) print(f"{name}: CE={ce:.2f}, PPL={ppl:.1f}")

💡 Practical Examples

Example 1: Information in Language Models

When a language model predicts the next token, the information content reveals how "surprising" each prediction is:

# Context: "The capital of France is" # Model predictions: # "Paris": 0.85 probability → I = -log₂(0.85) = 0.23 bits # "Lyon": 0.10 probability → I = -log₂(0.10) = 3.32 bits # "Berlin": 0.05 probability → I = -log₂(0.05) = 4.32 bits # High-probability predictions (like "Paris") convey little information # because they're expected. Rare predictions convey more information.

Key takeaway: Well-trained models produce low information content on average because they learn to predict likely tokens.

Example 2: Entropy in Decision Trees

Entropy helps decide which feature splits data most effectively:

# Dataset: 10 animals, classify as mammal or reptile # Feature: "Has fur?" # Before split: 5 mammals, 5 reptiles # H = -0.5*log₂(0.5) - 0.5*log₂(0.5) = 1 bit (maximum uncertainty) # After split on "Has fur": # - Has fur (6 animals): 5 mammals, 1 reptile # H = -5/6*log₂(5/6) - 1/6*log₂(1/6) = 0.65 bits # # - No fur (4 animals): 0 mammals, 4 reptiles # H = 0 bits (pure group!) # Information gain = 1 - (6/10*0.65 + 4/10*0) = 0.61 bits # This feature reduces uncertainty significantly.

Key takeaway: Lower entropy after a split means the feature is useful for classification.

Example 3: Cross-Entropy in Neural Networks

Cross-entropy loss drives neural network training for classification:

import torch import torch.nn.functional as F # 3-class classification: cat, dog, bird # True label: cat (index 0) true_label = torch.tensor([0]) # Class index # Model outputs (logits) before softmax logits = torch.tensor([[2.0, 1.0, 0.1]]) # Convert to probabilities with softmax probs = F.softmax(logits, dim=1) # probs = [[0.659, 0.242, 0.099]] # Cross-entropy loss (PyTorch combines softmax + NLL) loss = F.cross_entropy(logits, true_label) # loss = -log(0.659) = 0.417 # Compare with confident wrong prediction: wrong_logits = torch.tensor([[0.1, 2.0, 1.0]]) # Thinks "dog" wrong_loss = F.cross_entropy(wrong_logits, true_label) # wrong_loss = -log(0.099) = 3.34 (much higher!) # Training minimizes this loss, pushing model to be confident AND correct

Key takeaway: Cross-entropy penalizes confident wrong predictions heavily, making the model cautious about its certainty.