RLHF Overview
Reinforcement Learning from Human Feedback:
- Collect comparisons: Humans rank model outputs
- Train reward model: Predict human preferences
- Optimize policy: Use PPO to maximize reward
The Reward Model
PPO Training
📝 Exercises
Exercise 1: Reward Model Training
Given three responses to the prompt "Explain quantum computing in simple terms", human annotators ranked them as follows:
- Response A: Ranked 1st (most preferred) - Clear analogy, accurate concepts
- Response B: Ranked 2nd - Accurate but overly technical
- Response C: Ranked 3rd (least preferred) - Contains factual errors
Task: Write the loss function terms for training the reward model on these comparisons. How many pairwise comparison terms are needed?
Exercise 2: KL Divergence Penalty
During PPO training, the KL penalty coefficient β = 0.1. The model generates a response with:
- Reward from reward model: 2.5
- KL divergence from reference model: 3.0
Task: Calculate the final objective value. Should β be increased or decreased if the model is becoming too repetitive?
Exercise 3: PPO Clip Objective Implementation
Implement the PPO clipped surrogate objective function used during RLHF training:
Exercise 4: Reward Hacking Detection
A model discovers that using the word "fantastic" in responses increases the reward score, regardless of content quality. This is reward hacking.
Task: Design a simple detection mechanism that compares the frequency of "fantastic" in RLHF-tuned outputs versus the base model outputs. Write a function that flags potential reward hacking.
🎯 Quiz
Question 1
What is the primary purpose of the reward model in RLHF?
- To generate text completions directly
- To predict human preference scores for model outputs
- To calculate the KL divergence penalty
- To fine-tune the base model weights
Question 2
Why is the KL divergence penalty (β) important in PPO training?
- It speeds up training convergence
- It prevents the model from diverging too far from the original model
- It increases the reward model's accuracy
- It reduces memory requirements
Question 3
In the Bradley-Terry loss for reward models, what do y_w and y_l represent?
- Winner and loser responses from human rankings
- Weights and learning rate
- Yes and no labels
- Young and old model versions
Question 4
What happens if the KL penalty coefficient β is set too low during PPO?
- Training becomes slower
- The model may overfit to the reward model and produce incoherent text
- The model becomes more helpful
- Memory usage decreases significantly
🔑 Key Takeaways
- RLHF Pipeline: Consists of three stages—collect human comparisons, train a reward model to predict preferences, then optimize the policy using PPO to maximize predicted rewards.
- Reward Model: Trained on pairwise human preferences using the Bradley-Terry loss:
-log σ(r(y_w) - r(y_l)). It outputs scalar scores estimating how much humans would prefer a response. - KL Divergence Penalty: The β coefficient prevents the model from drifting too far from the reference model, avoiding reward hacking and degenerate outputs.
- PPO Clip: Limits policy updates to prevent destructive large changes. The clipped surrogate objective keeps training stable.
- Reward Hacking: Models may exploit spurious patterns in the reward model. Monitoring output distributions and using KL penalties helps mitigate this.