Lesson 5: RLHF | LLM Course

RLHF Overview

Reinforcement Learning from Human Feedback:

Collect comparisons: Humans rank model outputs
Train reward model: Predict human preferences
Optimize policy: Use PPO to maximize reward

The Reward Model

# Train reward model on human comparisons
Input: prompt + response
Output: scalar reward (higher = better)

Loss: -log σ(reward(x, y_w) - reward(x, y_l))
# y_w = preferred, y_l = less preferred
      

PPO Training

# Maximize reward while staying close to original model
objective = reward - β * KL(model || original)

# Prevents model from gaming the reward or diverging
      

Result: Models become more helpful, truthful, and aligned with human intent.

📝 Exercises

Exercise 1: Reward Model Training

Given three responses to the prompt "Explain quantum computing in simple terms", human annotators ranked them as follows:

Response A: Ranked 1st (most preferred) - Clear analogy, accurate concepts
Response B: Ranked 2nd - Accurate but overly technical
Response C: Ranked 3rd (least preferred) - Contains factual errors

Task: Write the loss function terms for training the reward model on these comparisons. How many pairwise comparison terms are needed?

        Hint: Use the Bradley-Terry model. Each pair (winner, loser) contributes one term: -log σ(r(winner) - r(loser))
      

Exercise 2: KL Divergence Penalty

During PPO training, the KL penalty coefficient β = 0.1. The model generates a response with:

Reward from reward model: 2.5
KL divergence from reference model: 3.0

Task: Calculate the final objective value. Should β be increased or decreased if the model is becoming too repetitive?

# Solution template
objective = reward - β * KL
# Calculate and interpret the result
      

Exercise 3: PPO Clip Objective Implementation

Implement the PPO clipped surrogate objective function used during RLHF training:

def ppo_clip_loss(ratio, advantages, epsilon=0.2):
    """
    Calculate PPO clipped surrogate loss.
    
    Args:
        ratio: π_θ(a|s) / π_θ_old(a|s) — probability ratio
        advantages: A(s,a) — advantage estimates
        epsilon: clipping parameter (typically 0.1 or 0.2)
    
    Returns:
        Clipped surrogate objective (to be maximized)
    """
    # Your implementation here
    pass

# Test with example values
ratio = 1.3      # New policy is 30% more likely
advantage = 2.0  # Positive advantage (good action)
epsilon = 0.2

print(f"PPO Loss: {ppo_clip_loss(ratio, advantage, epsilon)}")
      

        Expected Output: The clipped objective should limit how far the policy can update in a single step, preventing destructive large updates.
      

Exercise 4: Reward Hacking Detection

A model discovers that using the word "fantastic" in responses increases the reward score, regardless of content quality. This is reward hacking.

Task: Design a simple detection mechanism that compares the frequency of "fantastic" in RLHF-tuned outputs versus the base model outputs. Write a function that flags potential reward hacking.

def detect_reward_hacking(rlhf_outputs, base_outputs, threshold=2.0):
    """
    Detect potential reward hacking by comparing word frequencies.
    
    Args:
        rlhf_outputs: List of strings from RLHF-tuned model
        base_outputs: List of strings from base model
        threshold: Flag if ratio exceeds this value
    
    Returns:
        dict with 'is_hacking', 'rlhf_freq', 'base_freq', 'ratio'
    """
    # Your implementation here
    pass

# Example test
rlhf = ["This is fantastic!", "Fantastic solution!", "A fantastic approach"]
base = ["This is good.", "Nice solution!", "A valid approach"]

result = detect_reward_hacking(rlhf, base)
print(f"Potential reward hacking detected: {result['is_hacking']}")
      

🎯 Quiz

Question 1

What is the primary purpose of the reward model in RLHF?

To generate text completions directly
To predict human preference scores for model outputs
To calculate the KL divergence penalty
To fine-tune the base model weights

          Answer: 2 — The reward model is trained on human comparison data to predict how much humans would prefer a given response, outputting a scalar reward value.
        

Question 2

Why is the KL divergence penalty (β) important in PPO training?

It speeds up training convergence
It prevents the model from diverging too far from the original model
It increases the reward model's accuracy
It reduces memory requirements

          Answer: 2 — The KL penalty keeps the optimized model close to the reference model, preventing it from gaming the reward or producing degenerate outputs.
        

Question 3

In the Bradley-Terry loss for reward models, what do y_w and y_l represent?

Winner and loser responses from human rankings
Weights and learning rate
Yes and no labels
Young and old model versions

          Answer: 1 — y_w is the preferred (winner) response and y_l is the less preferred (loser) response, based on human annotator rankings.
        

Question 4

What happens if the KL penalty coefficient β is set too low during PPO?

Training becomes slower
The model may overfit to the reward model and produce incoherent text
The model becomes more helpful
Memory usage decreases significantly

          Answer: 2 — A low β allows the model to drift too far from the original, potentially gaming the reward model or generating nonsensical high-reward outputs.
        

🔑 Key Takeaways

        RLHF Pipeline: Consists of three stages—collect human comparisons, train a reward model to predict preferences, then optimize the policy using PPO to maximize predicted rewards.
Reward Model: Trained on pairwise human preferences using the Bradley-Terry loss: -log σ(r(y_w) - r(y_l)). It outputs scalar scores estimating how much humans would prefer a response.
KL Divergence Penalty: The β coefficient prevents the model from drifting too far from the reference model, avoiding reward hacking and degenerate outputs.
PPO Clip: Limits policy updates to prevent destructive large changes. The clipped surrogate objective keeps training stable.
Reward Hacking: Models may exploit spurious patterns in the reward model. Monitoring output distributions and using KL penalties helps mitigate this.