🚧 Lesson 6 of 35 in Level 04
Level 04 • Lesson 6

Training at Scale

Distributed training, mixed precision, gradient accumulation, and efficiency.

Distributed Training

Training large models requires multiple GPUs:

Mixed Precision

# Use FP16/BF16 for forward/backward, FP32 for updates with torch.cuda.amp.autocast(): output = model(input) loss = criterion(output, target) # 2-3x speedup, half memory, minimal accuracy loss

Gradient Accumulation

Simulate large batch sizes with limited memory:

for i, batch in enumerate(dataloader): loss = model(batch) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()

Key Takeaways

  • Distributed training is essential for large models — choose data parallel for most cases, FSDP for very large models
  • Mixed precision (FP16/BF16) provides 2-3x speedup with minimal accuracy loss by using lower precision for compute, FP32 for weight updates
  • Gradient accumulation lets you simulate large batch sizes on limited memory by splitting batches and accumulating gradients
  • Modern training stacks (DeepSpeed, FSDP, Accelerate) handle most distributed complexity automatically
  • Always profile your training — communication overhead can dominate at scale

Hands-On Exercises

Exercise 1: Implement Gradient Accumulation

Complete the training loop below to implement gradient accumulation with accumulation_steps=4:

def train_with_accumulation(model, dataloader, optimizer, criterion, accumulation_steps=4): model.train() total_loss = 0 for i, (inputs, targets) in enumerate(dataloader): # TODO: Forward pass outputs = model(inputs) loss = criterion(outputs, targets) # TODO: Scale loss and compute gradients # Hint: Divide loss by accumulation_steps before backward() # TODO: Update weights only every accumulation_steps batches # Hint: Use (i + 1) % accumulation_steps == 0 return total_loss / len(dataloader)

Solution:

def train_with_accumulation(model, dataloader, optimizer, criterion, accumulation_steps=4): model.train() total_loss = 0 optimizer.zero_grad() # Clear at start for i, (inputs, targets) in enumerate(dataloader): outputs = model(inputs) loss = criterion(outputs, targets) # Scale loss to maintain effective batch size loss = loss / accumulation_steps loss.backward() total_loss += loss.item() * accumulation_steps # Update weights every accumulation_steps if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() return total_loss / len(dataloader)

Exercise 2: Mixed Precision Training Setup

Set up automatic mixed precision (AMP) training with PyTorch:

import torch from torch.cuda.amp import autocast, GradScaler def train_amp(model, dataloader, optimizer, criterion): scaler = GradScaler() model.train() for inputs, targets in dataloader: optimizer.zero_grad() # TODO: Wrap forward pass with autocast() # TODO: Scale loss and backward with scaler # TODO: Step optimizer with scaler return model

Solution:

import torch from torch.cuda.amp import autocast, GradScaler def train_amp(model, dataloader, optimizer, criterion): scaler = GradScaler() model.train() for inputs, targets in dataloader: optimizer.zero_grad() # Automatic mixed precision context with autocast(): outputs = model(inputs) loss = criterion(outputs, targets) # Scale loss and compute gradients scaler.scale(loss).backward() # Unscale and step scaler.step(optimizer) scaler.update() return model

Key Points:

  • autocast() automatically selects FP16/BF16 for compatible ops
  • GradScaler handles loss scaling to prevent gradient underflow
  • Always call scaler.update() after each step
  • ~2x speedup on modern GPUs with minimal accuracy loss