🚧 Lesson 9 of 25 in Level 05
Level 05 • Lesson 9

KL Divergence

Measuring distribution differences. Applications in ML.

KL Divergence

Kullback-Leibler divergence measures difference between distributions:

# D_KL(P || Q) = sum P(x) log(P(x)/Q(x)) # = sum P(x) log P(x) - sum P(x) log Q(x) # = -H(P) + H(P,Q) # Always ≥ 0 # = 0 iff P = Q # Not symmetric: D_KL(P||Q) ≠ D_KL(Q||P)

Interpretation

Extra bits needed: How many more bits on average to encode samples from P using a code optimized for Q.

Applications in ML

📝 Practice Exercises

Exercise 1: Calculate KL Divergence

Given two discrete distributions:

P = [0.5, 0.3, 0.2] # True distribution Q = [0.4, 0.4, 0.2] # Approximation

Task: Calculate D_KL(P || Q). Show your work.

Hint: D_KL(P||Q) = Σ P(i) × log(P(i)/Q(i))

Exercise 2: Python Implementation

Complete the function to compute KL divergence:

def kl_divergence(p, q): """ Compute KL divergence D_KL(P || Q) p, q: arrays of probabilities (same length, sum to 1) Returns: scalar KL divergence """ import numpy as np # Your code here # Test with: # p = np.array([0.5, 0.3, 0.2]) # q = np.array([0.4, 0.4, 0.2]) # Expected: ~0.0104 nats (or ~0.015 bits if log2)

Bonus: Add input validation to ensure p and q are valid probability distributions.

💡 Solutions

Exercise 1 Solution:

D_KL = 0.5×log(0.5/0.4) + 0.3×log(0.3/0.4) + 0.2×log(0.2/0.2) = 0.5×log(1.25) + 0.3×log(0.75) + 0.2×log(1) = 0.5×0.223 + 0.3×(-0.288) + 0 = 0.1115 - 0.0864 ≈ 0.0251 (nats) or 0.036 (bits)

Exercise 2 Solution:

def kl_divergence(p, q, eps=1e-10): import numpy as np p = np.array(p) q = np.array(q) # Validation assert np.isclose(p.sum(), 1), "P must sum to 1" assert np.isclose(q.sum(), 1), "Q must sum to 1" assert (p >= 0).all() and (q >= 0).all(), "Probabilities must be non-negative" # Avoid log(0) by adding epsilon q = np.clip(q, eps, 1) return np.sum(p * np.log(p / q))

🎯 Knowledge Check Quiz

Test your understanding of KL Divergence with these quick questions:

Question 1

What is the minimum possible value of KL divergence D_KL(P || Q)?

  • A) -∞
  • B) 0
  • C) 1
  • D) It depends on the distributions

Answer: B) 0 — KL divergence is always non-negative and equals 0 only when P = Q.

Question 2

Is KL divergence symmetric? That is, does D_KL(P || Q) = D_KL(Q || P)?

  • A) Yes, always
  • B) No, never
  • C) Only when P = Q
  • D) Only for Gaussian distributions

Answer: B) No — KL divergence is not symmetric. D_KL(P||Q) measures the "cost" of using Q to approximate P, which differs from using P to approximate Q.

Question 3

In a Variational Autoencoder (VAE), what role does KL divergence play?

  • A) It measures reconstruction error
  • B) It regularizes the latent distribution to match a prior
  • C) It computes the learning rate
  • D) It initializes the weights

Answer: B) It regularizes the latent distribution — the KL term ensures the learned latent distribution stays close to the prior (usually standard normal).

Question 4

What does KL divergence measure in information theory terms?

  • A) The entropy of distribution P
  • B) The mutual information between P and Q
  • C) The extra bits needed to encode P using a code optimized for Q
  • D) The correlation between P and Q

Answer: C) Extra bits needed — KL divergence quantifies the inefficiency of using Q to approximate P, measured in bits (or nats).

Question 5

In policy gradient methods like PPO, why is KL divergence used?

  • A) To increase exploration
  • B) To limit how far the policy can change in one update
  • C) To compute the reward function
  • D) To normalize the gradients

Answer: B) To limit policy changes — KL constraints prevent the policy from changing too drastically, ensuring stable training.