Lesson 7: Optimization Challenges

Common Problems

Training Instabilities

Loss spikes: Sudden increases in loss
Gradient explosion: Gradients become NaN
Divergence: Loss increases steadily
Plateau: Loss stops decreasing

Solutions

Gradient clipping: max_norm = 1.0
Warmup: Gradual LR increase
Weight decay: L2 regularization
Learning rate tuning: Find optimal LR

Learning Rate Schedules

# Cosine decay with warmup
warmup_steps = 2000
total_steps = 100000

if step < warmup_steps:
    lr = base_lr * step / warmup_steps
else:
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    lr = base_lr * 0.5 * (1 + cos(π * progress))
      

📝 Hands-On Exercises

Exercise 1: Implement Gradient Clipping

Write a function that clips gradients by their global norm before applying them to model parameters.

# Exercise: Implement gradient clipping by global norm
def clip_gradients_by_norm(parameters, max_norm=1.0):
    """
    Clips gradients by their global L2 norm.
    
    Args:
        parameters: List of parameter tensors with gradients
        max_norm: Maximum allowed global norm
    
    Returns:
        clipped_parameters: Parameters with clipped gradients
    """
    # Your implementation here:
    # 1. Calculate total gradient norm across all parameters
    # 2. If norm > max_norm, scale all gradients by max_norm / total_norm
    # 3. Return the parameters
    pass

# Test your implementation
test_params = [
    torch.randn(10, requires_grad=True),
    torch.randn(5, requires_grad=True)
]
test_params[0].grad = torch.ones(10) * 5
test_params[1].grad = torch.ones(5) * 3

# Expected: total_norm = sqrt(10*25 + 5*9) = sqrt(295) ≈ 17.2
# After clipping with max_norm=1.0, gradients should be scaled by ~0.058
        

Key Points: Gradient clipping prevents exploding gradients by capping the total gradient magnitude before the optimizer step.

Exercise 2: Learning Rate Warmup Scheduler

Implement a learning rate scheduler that linearly warms up from 0 to base_lr, then uses cosine decay.

# Exercise: Implement warmup + cosine decay scheduler
class WarmupCosineScheduler:
    def __init__(self, optimizer, warmup_steps, total_steps, base_lr, min_lr=0):
        self.optimizer = optimizer
        self.warmup_steps = warmup_steps
        self.total_steps = total_steps
        self.base_lr = base_lr
        self.min_lr = min_lr
        self.current_step = 0
    
    def step(self):
        """Update learning rate for current step."""
        self.current_step += 1
        
        # Your implementation here:
        # 1. If in warmup phase: lr = base_lr * (current_step / warmup_steps)
        # 2. If past warmup: use cosine decay from base_lr to min_lr
        # 3. Update optimizer's learning rate
        
        lr = self._compute_lr()
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        return lr
    
    def _compute_lr(self):
        # Implement the LR calculation
        pass

# Test: scheduler should return 0.0 at step 0, base_lr at warmup_steps,
# and smoothly decay to min_lr by total_steps
        

Key Points: Warmup prevents early training instability by gradually increasing LR. Cosine decay helps fine-tune in later stages.

💡 Solutions & Tips

Gradient Clipping: Use torch.nn.utils.clip_grad_norm_() in PyTorch, or compute total_norm manually with (grad ** 2).sum().sqrt()
Warmup: Start with very small LR (near 0) to let model stabilize before full updates
Detection: Monitor for NaN/Inf in loss and gradients; use torch.isnan() and torch.isinf()
Recovery: Save checkpoints frequently; if training diverges, restart from last good checkpoint with lower LR

🎯 Knowledge Check Quiz

Question 1: Gradient Clipping

When training a large transformer model, you notice the loss suddenly jumps to NaN after a few hundred steps. What is the most likely cause and what technique should you apply?

Answer: This is likely gradient explosion — gradients are growing exponentially and exceeding numerical limits. Apply gradient clipping by limiting the global gradient norm (typically max_norm=1.0) before the optimizer step. This caps the total gradient magnitude while preserving direction.

Question 2: Learning Rate Warmup

Why is learning rate warmup important in the early stages of training large language models?

Answer: Early in training, model parameters are randomly initialized and gradients can be noisy/unstable. Starting with a large learning rate causes erratic updates that can destabilize training. Warmup gradually increases LR from near-zero, allowing the model to settle into a stable region of parameter space before applying full-strength updates.

Question 3: Loss Plateau

Your model's loss stops decreasing after epoch 50, but validation metrics suggest underfitting. Which of the following is NOT a good solution?

A) Increase learning rate
B) Add learning rate decay/scheduler
C) Reduce weight decay
D) Use a more complex optimizer (e.g., AdamW → LAMB)

Answer: A) Increase learning rate is NOT a good solution. A plateau suggests the optimizer is stuck in a flat region or local minimum. Increasing LR would cause oscillation, not improvement. Better approaches: use LR decay/annealing to fine-tune (B), reduce regularization to allow more fitting (C), or switch to optimizers designed for large batch training (D).

Question 4: Cosine Decay Schedule

In the cosine decay formula lr = base_lr * 0.5 * (1 + cos(π * progress)), what happens when progress = 0.5 (halfway through training after warmup)?

Answer: At progress = 0.5, cos(π * 0.5) = cos(π/2) = 0. Therefore: lr = base_lr * 0.5 * (1 + 0) = base_lr * 0.5. The learning rate will be exactly half of the base learning rate. Cosine decay provides smooth, gradual reduction compared to step decay.

Question 5: Detecting Instability

Which monitoring checks should you implement to catch training instabilities early? (Select all that apply)

A) Check if loss is NaN or Inf after each step
B) Monitor gradient norm statistics
C) Verify learning rate is positive
D) Track ratio of update magnitude to parameter magnitude

Answer: A, B, and D are correct. A) NaN/Inf detection catches immediate failures. B) Gradient norms reveal exploding/vanishing gradients before they cause NaN. D) The update-to-parameter ratio (e.g., should be ~0.001) indicates if steps are too aggressive. C) is trivial — LR schedulers always produce positive values.

🎯 Additional Quiz: Advanced Optimization

Question 1: Gradient Accumulation

You're training with a batch size of 4 due to memory constraints, but want to simulate batch size 32. How does gradient accumulation help, and what is the trade-off?

Answer: Perform 8 forward/backward passes (32/4) and accumulate gradients before calling optimizer.step(). This simulates larger batch training without extra memory. Trade-off: Training takes 8x longer per "effective batch" since we do 8 forward passes sequentially.

Question 2: Mixed Precision Training

Why might loss scaling be necessary when using FP16 (half-precision) training?

Answer: FP16 has a smaller dynamic range (~5e-8 to 65504). Small gradients can underflow to zero, stopping learning. Loss scaling multiplies the loss by a factor (e.g., 1024) before backward pass, making gradients larger and representable in FP16. Gradients are unscaled before the optimizer step.

Question 3: Learning Rate Finder

Describe the LR range test procedure for finding an optimal learning rate.

Answer: 1) Start with very small LR (1e-7). 2) Run training for a few batches, exponentially increasing LR after each batch. 3) Record loss at each step. 4) Plot loss vs LR. 5) Choose LR where loss decreases fastest (steepest downward slope), typically just before loss starts increasing or exploding.

Question 4: Weight Decay vs L2 Regularization

What is the difference between weight decay and L2 regularization in AdamW vs Adam?

Answer: In Adam, L2 regularization is applied to gradients before the adaptive update (divided by sqrt(momentum)), weakening its effect. AdamW decouples weight decay from gradients, applying it directly to weights after the adaptive step. This makes weight decay behave consistently regardless of gradient magnitude.

Question 5: Gradient Checkpointing

How does gradient checkpointing reduce memory usage, and what is the cost?

Answer: Instead of storing all activations for backward pass, only store activations at checkpoint layers. Recompute intermediate activations during backward pass by re-running forward pass segments. Cost: ~20-30% slower training due to recomputation, but enables 2-3x larger models/batch sizes.