KL Divergence
Kullback-Leibler divergence measures difference between distributions:
Interpretation
Applications in ML
- VAEs: Regularize latent distribution
- Knowledge distillation: Match student to teacher
- RL: Limit policy changes (TRPO, PPO)
- Variational inference: Approximate posteriors
📝 Practice Exercises
Exercise 1: Calculate KL Divergence
Given two discrete distributions:
Task: Calculate D_KL(P || Q). Show your work.
Hint: D_KL(P||Q) = Σ P(i) × log(P(i)/Q(i))
Exercise 2: Python Implementation
Complete the function to compute KL divergence:
Bonus: Add input validation to ensure p and q are valid probability distributions.
💡 Solutions
Exercise 1 Solution:
Exercise 2 Solution:
🎯 Knowledge Check Quiz
Test your understanding of KL Divergence with these quick questions:
Question 1
What is the minimum possible value of KL divergence D_KL(P || Q)?
- A) -∞
- B) 0
- C) 1
- D) It depends on the distributions
Answer: B) 0 — KL divergence is always non-negative and equals 0 only when P = Q.
Question 2
Is KL divergence symmetric? That is, does D_KL(P || Q) = D_KL(Q || P)?
- A) Yes, always
- B) No, never
- C) Only when P = Q
- D) Only for Gaussian distributions
Answer: B) No — KL divergence is not symmetric. D_KL(P||Q) measures the "cost" of using Q to approximate P, which differs from using P to approximate Q.
Question 3
In a Variational Autoencoder (VAE), what role does KL divergence play?
- A) It measures reconstruction error
- B) It regularizes the latent distribution to match a prior
- C) It computes the learning rate
- D) It initializes the weights
Answer: B) It regularizes the latent distribution — the KL term ensures the learned latent distribution stays close to the prior (usually standard normal).
Question 4
What does KL divergence measure in information theory terms?
- A) The entropy of distribution P
- B) The mutual information between P and Q
- C) The extra bits needed to encode P using a code optimized for Q
- D) The correlation between P and Q
Answer: C) Extra bits needed — KL divergence quantifies the inefficiency of using Q to approximate P, measured in bits (or nats).
Question 5
In policy gradient methods like PPO, why is KL divergence used?
- A) To increase exploration
- B) To limit how far the policy can change in one update
- C) To compute the reward function
- D) To normalize the gradients
Answer: B) To limit policy changes — KL constraints prevent the policy from changing too drastically, ensuring stable training.