Common Problems
Training Instabilities
- Loss spikes: Sudden increases in loss
- Gradient explosion: Gradients become NaN
- Divergence: Loss increases steadily
- Plateau: Loss stops decreasing
Solutions
- Gradient clipping: max_norm = 1.0
- Warmup: Gradual LR increase
- Weight decay: L2 regularization
- Learning rate tuning: Find optimal LR
Learning Rate Schedules
📝 Hands-On Exercises
Exercise 1: Implement Gradient Clipping
Write a function that clips gradients by their global norm before applying them to model parameters.
Key Points: Gradient clipping prevents exploding gradients by capping the total gradient magnitude before the optimizer step.
Exercise 2: Learning Rate Warmup Scheduler
Implement a learning rate scheduler that linearly warms up from 0 to base_lr, then uses cosine decay.
Key Points: Warmup prevents early training instability by gradually increasing LR. Cosine decay helps fine-tune in later stages.
💡 Solutions & Tips
- Gradient Clipping: Use
torch.nn.utils.clip_grad_norm_()in PyTorch, or compute total_norm manually with(grad ** 2).sum().sqrt() - Warmup: Start with very small LR (near 0) to let model stabilize before full updates
- Detection: Monitor for NaN/Inf in loss and gradients; use
torch.isnan()andtorch.isinf() - Recovery: Save checkpoints frequently; if training diverges, restart from last good checkpoint with lower LR
🎯 Knowledge Check Quiz
Question 1: Gradient Clipping
When training a large transformer model, you notice the loss suddenly jumps to NaN after a few hundred steps. What is the most likely cause and what technique should you apply?
Answer: This is likely gradient explosion — gradients are growing exponentially and exceeding numerical limits. Apply gradient clipping by limiting the global gradient norm (typically max_norm=1.0) before the optimizer step. This caps the total gradient magnitude while preserving direction.
Question 2: Learning Rate Warmup
Why is learning rate warmup important in the early stages of training large language models?
Answer: Early in training, model parameters are randomly initialized and gradients can be noisy/unstable. Starting with a large learning rate causes erratic updates that can destabilize training. Warmup gradually increases LR from near-zero, allowing the model to settle into a stable region of parameter space before applying full-strength updates.
Question 3: Loss Plateau
Your model's loss stops decreasing after epoch 50, but validation metrics suggest underfitting. Which of the following is NOT a good solution?
- A) Increase learning rate
- B) Add learning rate decay/scheduler
- C) Reduce weight decay
- D) Use a more complex optimizer (e.g., AdamW → LAMB)
Answer: A) Increase learning rate is NOT a good solution. A plateau suggests the optimizer is stuck in a flat region or local minimum. Increasing LR would cause oscillation, not improvement. Better approaches: use LR decay/annealing to fine-tune (B), reduce regularization to allow more fitting (C), or switch to optimizers designed for large batch training (D).
Question 4: Cosine Decay Schedule
In the cosine decay formula lr = base_lr * 0.5 * (1 + cos(π * progress)), what happens when progress = 0.5 (halfway through training after warmup)?
Answer: At progress = 0.5, cos(π * 0.5) = cos(π/2) = 0. Therefore: lr = base_lr * 0.5 * (1 + 0) = base_lr * 0.5. The learning rate will be exactly half of the base learning rate. Cosine decay provides smooth, gradual reduction compared to step decay.
Question 5: Detecting Instability
Which monitoring checks should you implement to catch training instabilities early? (Select all that apply)
- A) Check if loss is NaN or Inf after each step
- B) Monitor gradient norm statistics
- C) Verify learning rate is positive
- D) Track ratio of update magnitude to parameter magnitude
Answer: A, B, and D are correct. A) NaN/Inf detection catches immediate failures. B) Gradient norms reveal exploding/vanishing gradients before they cause NaN. D) The update-to-parameter ratio (e.g., should be ~0.001) indicates if steps are too aggressive. C) is trivial — LR schedulers always produce positive values.
🎯 Additional Quiz: Advanced Optimization
Question 1: Gradient Accumulation
You're training with a batch size of 4 due to memory constraints, but want to simulate batch size 32. How does gradient accumulation help, and what is the trade-off?
Answer: Perform 8 forward/backward passes (32/4) and accumulate gradients before calling optimizer.step(). This simulates larger batch training without extra memory. Trade-off: Training takes 8x longer per "effective batch" since we do 8 forward passes sequentially.
Question 2: Mixed Precision Training
Why might loss scaling be necessary when using FP16 (half-precision) training?
Answer: FP16 has a smaller dynamic range (~5e-8 to 65504). Small gradients can underflow to zero, stopping learning. Loss scaling multiplies the loss by a factor (e.g., 1024) before backward pass, making gradients larger and representable in FP16. Gradients are unscaled before the optimizer step.
Question 3: Learning Rate Finder
Describe the LR range test procedure for finding an optimal learning rate.
Answer: 1) Start with very small LR (1e-7). 2) Run training for a few batches, exponentially increasing LR after each batch. 3) Record loss at each step. 4) Plot loss vs LR. 5) Choose LR where loss decreases fastest (steepest downward slope), typically just before loss starts increasing or exploding.
Question 4: Weight Decay vs L2 Regularization
What is the difference between weight decay and L2 regularization in AdamW vs Adam?
Answer: In Adam, L2 regularization is applied to gradients before the adaptive update (divided by sqrt(momentum)), weakening its effect. AdamW decouples weight decay from gradients, applying it directly to weights after the adaptive step. This makes weight decay behave consistently regardless of gradient magnitude.
Question 5: Gradient Checkpointing
How does gradient checkpointing reduce memory usage, and what is the cost?
Answer: Instead of storing all activations for backward pass, only store activations at checkpoint layers. Recompute intermediate activations during backward pass by re-running forward pass segments. Cost: ~20-30% slower training due to recomputation, but enables 2-3x larger models/batch sizes.