Lesson 9: KL Divergence

KL Divergence

Kullback-Leibler divergence measures difference between distributions:

# D_KL(P || Q) = sum P(x) log(P(x)/Q(x))
#              = sum P(x) log P(x) - sum P(x) log Q(x)
#              = -H(P) + H(P,Q)

# Always ≥ 0
# = 0 iff P = Q
# Not symmetric: D_KL(P||Q) ≠ D_KL(Q||P)
      

Interpretation

        Extra bits needed: How many more bits on average to encode 
        samples from P using a code optimized for Q.
      

Applications in ML

VAEs: Regularize latent distribution
Knowledge distillation: Match student to teacher
RL: Limit policy changes (TRPO, PPO)
Variational inference: Approximate posteriors

📝 Practice Exercises

Exercise 1: Calculate KL Divergence

Given two discrete distributions:

P = [0.5, 0.3, 0.2]  # True distribution
Q = [0.4, 0.4, 0.2]  # Approximation
        

Task: Calculate D_KL(P || Q). Show your work.

Hint: D_KL(P||Q) = Σ P(i) × log(P(i)/Q(i))

Exercise 2: Python Implementation

Complete the function to compute KL divergence:

def kl_divergence(p, q):
    """
    Compute KL divergence D_KL(P || Q)
    p, q: arrays of probabilities (same length, sum to 1)
    Returns: scalar KL divergence
    """
    import numpy as np
    # Your code here
    
# Test with:
# p = np.array([0.5, 0.3, 0.2])
# q = np.array([0.4, 0.4, 0.2])
# Expected: ~0.0104 nats (or ~0.015 bits if log2)
        

Bonus: Add input validation to ensure p and q are valid probability distributions.

💡 Solutions

Exercise 1 Solution:

D_KL = 0.5×log(0.5/0.4) + 0.3×log(0.3/0.4) + 0.2×log(0.2/0.2)
     = 0.5×log(1.25) + 0.3×log(0.75) + 0.2×log(1)
     = 0.5×0.223 + 0.3×(-0.288) + 0
     = 0.1115 - 0.0864
     ≈ 0.0251 (nats) or 0.036 (bits)
        

Exercise 2 Solution:

def kl_divergence(p, q, eps=1e-10):
    import numpy as np
    p = np.array(p)
    q = np.array(q)
    # Validation
    assert np.isclose(p.sum(), 1), "P must sum to 1"
    assert np.isclose(q.sum(), 1), "Q must sum to 1"
    assert (p >= 0).all() and (q >= 0).all(), "Probabilities must be non-negative"
    # Avoid log(0) by adding epsilon
    q = np.clip(q, eps, 1)
    return np.sum(p * np.log(p / q))
        

🎯 Knowledge Check Quiz

Test your understanding of KL Divergence with these quick questions:

Question 1

What is the minimum possible value of KL divergence D_KL(P || Q)?

A) -∞
B) 0
C) 1
D) It depends on the distributions

Answer: B) 0 — KL divergence is always non-negative and equals 0 only when P = Q.

Question 2

Is KL divergence symmetric? That is, does D_KL(P || Q) = D_KL(Q || P)?

A) Yes, always
B) No, never
C) Only when P = Q
D) Only for Gaussian distributions

Answer: B) No — KL divergence is not symmetric. D_KL(P||Q) measures the "cost" of using Q to approximate P, which differs from using P to approximate Q.

Question 3

In a Variational Autoencoder (VAE), what role does KL divergence play?

A) It measures reconstruction error
B) It regularizes the latent distribution to match a prior
C) It computes the learning rate
D) It initializes the weights

Answer: B) It regularizes the latent distribution — the KL term ensures the learned latent distribution stays close to the prior (usually standard normal).

Question 4

What does KL divergence measure in information theory terms?

A) The entropy of distribution P
B) The mutual information between P and Q
C) The extra bits needed to encode P using a code optimized for Q
D) The correlation between P and Q

Answer: C) Extra bits needed — KL divergence quantifies the inefficiency of using Q to approximate P, measured in bits (or nats).

Question 5

In policy gradient methods like PPO, why is KL divergence used?

A) To increase exploration
B) To limit how far the policy can change in one update
C) To compute the reward function
D) To normalize the gradients

Answer: B) To limit policy changes — KL constraints prevent the policy from changing too drastically, ensuring stable training.