Lesson 8: Transformer Training

Training Challenges

Transformers are harder to train than RNNs:

No recurrent structure to stabilize gradients
Attention weights can become sharp quickly
Deep stacks amplify any instability

Learning Rate Warmup

# Warmup: linearly increase LR for first warmup_steps
# Then: decay with inverse square root or cosine

if step < warmup_steps:
    lr = base_lr * (step / warmup_steps)
else:
    lr = base_lr * sqrt(warmup_steps / step)
      

        Why warmup? Early in training, gradients are large and unstable. 
        Starting with small LR prevents early divergence.
      

Optimizer Choice

Adam (or AdamW) is standard:

β1 = 0.9 (momentum)
β2 = 0.98 (variance)
ε = 1e-9

Weight decay (AdamW) helps prevent overfitting.

Want to go deeper? Explore the Training Transformers at Scale supplement for detailed learning rate schedules, gradient clipping, mixed precision training, distributed training strategies, and interactive exercises.