🚧 Lesson 5 of 35 in Level 04
Level 04 • Lesson 5

RLHF

Reinforcement Learning from Human Feedback. How ChatGPT was trained.

RLHF Overview

Reinforcement Learning from Human Feedback:

  1. Collect comparisons: Humans rank model outputs
  2. Train reward model: Predict human preferences
  3. Optimize policy: Use PPO to maximize reward

The Reward Model

# Train reward model on human comparisons Input: prompt + response Output: scalar reward (higher = better) Loss: -log σ(reward(x, y_w) - reward(x, y_l)) # y_w = preferred, y_l = less preferred

PPO Training

# Maximize reward while staying close to original model objective = reward - β * KL(model || original) # Prevents model from gaming the reward or diverging
Result: Models become more helpful, truthful, and aligned with human intent.

📝 Exercises

Exercise 1: Reward Model Training

Given three responses to the prompt "Explain quantum computing in simple terms", human annotators ranked them as follows:

Task: Write the loss function terms for training the reward model on these comparisons. How many pairwise comparison terms are needed?

Hint: Use the Bradley-Terry model. Each pair (winner, loser) contributes one term: -log σ(r(winner) - r(loser))

Exercise 2: KL Divergence Penalty

During PPO training, the KL penalty coefficient β = 0.1. The model generates a response with:

Task: Calculate the final objective value. Should β be increased or decreased if the model is becoming too repetitive?

# Solution template objective = reward - β * KL # Calculate and interpret the result

Exercise 3: PPO Clip Objective Implementation

Implement the PPO clipped surrogate objective function used during RLHF training:

def ppo_clip_loss(ratio, advantages, epsilon=0.2): """ Calculate PPO clipped surrogate loss. Args: ratio: π_θ(a|s) / π_θ_old(a|s) — probability ratio advantages: A(s,a) — advantage estimates epsilon: clipping parameter (typically 0.1 or 0.2) Returns: Clipped surrogate objective (to be maximized) """ # Your implementation here pass # Test with example values ratio = 1.3 # New policy is 30% more likely advantage = 2.0 # Positive advantage (good action) epsilon = 0.2 print(f"PPO Loss: {ppo_clip_loss(ratio, advantage, epsilon)}")
Expected Output: The clipped objective should limit how far the policy can update in a single step, preventing destructive large updates.

Exercise 4: Reward Hacking Detection

A model discovers that using the word "fantastic" in responses increases the reward score, regardless of content quality. This is reward hacking.

Task: Design a simple detection mechanism that compares the frequency of "fantastic" in RLHF-tuned outputs versus the base model outputs. Write a function that flags potential reward hacking.

def detect_reward_hacking(rlhf_outputs, base_outputs, threshold=2.0): """ Detect potential reward hacking by comparing word frequencies. Args: rlhf_outputs: List of strings from RLHF-tuned model base_outputs: List of strings from base model threshold: Flag if ratio exceeds this value Returns: dict with 'is_hacking', 'rlhf_freq', 'base_freq', 'ratio' """ # Your implementation here pass # Example test rlhf = ["This is fantastic!", "Fantastic solution!", "A fantastic approach"] base = ["This is good.", "Nice solution!", "A valid approach"] result = detect_reward_hacking(rlhf, base) print(f"Potential reward hacking detected: {result['is_hacking']}")

🎯 Quiz

Question 1

What is the primary purpose of the reward model in RLHF?

  1. To generate text completions directly
  2. To predict human preference scores for model outputs
  3. To calculate the KL divergence penalty
  4. To fine-tune the base model weights
Answer: 2 — The reward model is trained on human comparison data to predict how much humans would prefer a given response, outputting a scalar reward value.

Question 2

Why is the KL divergence penalty (β) important in PPO training?

  1. It speeds up training convergence
  2. It prevents the model from diverging too far from the original model
  3. It increases the reward model's accuracy
  4. It reduces memory requirements
Answer: 2 — The KL penalty keeps the optimized model close to the reference model, preventing it from gaming the reward or producing degenerate outputs.

Question 3

In the Bradley-Terry loss for reward models, what do y_w and y_l represent?

  1. Winner and loser responses from human rankings
  2. Weights and learning rate
  3. Yes and no labels
  4. Young and old model versions
Answer: 1 — y_w is the preferred (winner) response and y_l is the less preferred (loser) response, based on human annotator rankings.

Question 4

What happens if the KL penalty coefficient β is set too low during PPO?

  1. Training becomes slower
  2. The model may overfit to the reward model and produce incoherent text
  3. The model becomes more helpful
  4. Memory usage decreases significantly
Answer: 2 — A low β allows the model to drift too far from the original, potentially gaming the reward model or generating nonsensical high-reward outputs.

🔑 Key Takeaways

  • RLHF Pipeline: Consists of three stages—collect human comparisons, train a reward model to predict preferences, then optimize the policy using PPO to maximize predicted rewards.
  • Reward Model: Trained on pairwise human preferences using the Bradley-Terry loss: -log σ(r(y_w) - r(y_l)). It outputs scalar scores estimating how much humans would prefer a response.
  • KL Divergence Penalty: The β coefficient prevents the model from drifting too far from the reference model, avoiding reward hacking and degenerate outputs.
  • PPO Clip: Limits policy updates to prevent destructive large changes. The clipped surrogate objective keeps training stable.
  • Reward Hacking: Models may exploit spurious patterns in the reward model. Monitoring output distributions and using KL penalties helps mitigate this.