Distributed Training
Training large models requires multiple GPUs:
- Data Parallel: Each GPU has full model, different batch
- Model Parallel: Each GPU has part of model
- Pipeline Parallel: Different layers on different GPUs
- FSDP: Fully Sharded Data Parallel (modern standard)
Mixed Precision
# Use FP16/BF16 for forward/backward, FP32 for updates
with torch.cuda.amp.autocast():
output = model(input)
loss = criterion(output, target)
# 2-3x speedup, half memory, minimal accuracy loss
Gradient Accumulation
Simulate large batch sizes with limited memory:
for i, batch in enumerate(dataloader):
loss = model(batch) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Key Takeaways
- Distributed training is essential for large models — choose data parallel for most cases, FSDP for very large models
- Mixed precision (FP16/BF16) provides 2-3x speedup with minimal accuracy loss by using lower precision for compute, FP32 for weight updates
- Gradient accumulation lets you simulate large batch sizes on limited memory by splitting batches and accumulating gradients
- Modern training stacks (DeepSpeed, FSDP, Accelerate) handle most distributed complexity automatically
- Always profile your training — communication overhead can dominate at scale
Hands-On Exercises
Exercise 1: Implement Gradient Accumulation
Complete the training loop below to implement gradient accumulation with accumulation_steps=4:
def train_with_accumulation(model, dataloader, optimizer, criterion, accumulation_steps=4):
model.train()
total_loss = 0
for i, (inputs, targets) in enumerate(dataloader):
# TODO: Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
# TODO: Scale loss and compute gradients
# Hint: Divide loss by accumulation_steps before backward()
# TODO: Update weights only every accumulation_steps batches
# Hint: Use (i + 1) % accumulation_steps == 0
return total_loss / len(dataloader)
Solution:
def train_with_accumulation(model, dataloader, optimizer, criterion, accumulation_steps=4):
model.train()
total_loss = 0
optimizer.zero_grad() # Clear at start
for i, (inputs, targets) in enumerate(dataloader):
outputs = model(inputs)
loss = criterion(outputs, targets)
# Scale loss to maintain effective batch size
loss = loss / accumulation_steps
loss.backward()
total_loss += loss.item() * accumulation_steps
# Update weights every accumulation_steps
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
return total_loss / len(dataloader)
Exercise 2: Mixed Precision Training Setup
Set up automatic mixed precision (AMP) training with PyTorch:
import torch
from torch.cuda.amp import autocast, GradScaler
def train_amp(model, dataloader, optimizer, criterion):
scaler = GradScaler()
model.train()
for inputs, targets in dataloader:
optimizer.zero_grad()
# TODO: Wrap forward pass with autocast()
# TODO: Scale loss and backward with scaler
# TODO: Step optimizer with scaler
return model
Solution:
import torch
from torch.cuda.amp import autocast, GradScaler
def train_amp(model, dataloader, optimizer, criterion):
scaler = GradScaler()
model.train()
for inputs, targets in dataloader:
optimizer.zero_grad()
# Automatic mixed precision context
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
# Scale loss and compute gradients
scaler.scale(loss).backward()
# Unscale and step
scaler.step(optimizer)
scaler.update()
return model
Key Points:
autocast()automatically selects FP16/BF16 for compatible opsGradScalerhandles loss scaling to prevent gradient underflow- Always call
scaler.update()after each step - ~2x speedup on modern GPUs with minimal accuracy loss