Training Challenges
Transformers are harder to train than RNNs:
- No recurrent structure to stabilize gradients
- Attention weights can become sharp quickly
- Deep stacks amplify any instability
Learning Rate Warmup
# Warmup: linearly increase LR for first warmup_steps
# Then: decay with inverse square root or cosine
if step < warmup_steps:
lr = base_lr * (step / warmup_steps)
else:
lr = base_lr * sqrt(warmup_steps / step)
Why warmup? Early in training, gradients are large and unstable.
Starting with small LR prevents early divergence.
Optimizer Choice
Adam (or AdamW) is standard:
- β1 = 0.9 (momentum)
- β2 = 0.98 (variance)
- ε = 1e-9
Weight decay (AdamW) helps prevent overfitting.
Want to go deeper? Explore the
Training Transformers at Scale
supplement for detailed learning rate schedules, gradient clipping, mixed precision training,
distributed training strategies, and interactive exercises.