🚧 Lesson 8 of 10 in Level 03
Level 03 • Lesson 8

Transformer Training

Training transformers efficiently. Learning rate schedules, warmup, and optimization.

Training Challenges

Transformers are harder to train than RNNs:

Learning Rate Warmup

# Warmup: linearly increase LR for first warmup_steps # Then: decay with inverse square root or cosine if step < warmup_steps: lr = base_lr * (step / warmup_steps) else: lr = base_lr * sqrt(warmup_steps / step)
Why warmup? Early in training, gradients are large and unstable. Starting with small LR prevents early divergence.

Optimizer Choice

Adam (or AdamW) is standard:

Weight decay (AdamW) helps prevent overfitting.

Want to go deeper? Explore the Training Transformers at Scale supplement for detailed learning rate schedules, gradient clipping, mixed precision training, distributed training strategies, and interactive exercises.