🚧 Lesson 5 of 8 in Level 02
Level 02 • Lesson 5

Optimization

Gradient descent, learning rates, and optimizers. How to actually update weights.

Gradient Descent

The basic algorithm for training neural networks:

Initialize weights randomly Repeat until convergence: 1. Compute gradients: ∇L = ∂Loss/∂W 2. Update weights: W = W - α · ∇L Where α (learning rate) controls step size
Intuition: Imagine walking down a hill in fog. You can only feel the slope under your feet. Gradient descent says: "Walk in the direction that feels steepest downward."

Types of Gradient Descent

Variant How it works Pros/Cons
Batch GD Use all training data Stable but slow
Stochastic GD One example at a time Fast but noisy
Mini-batch GD Small batches (32-512) Best of both (standard)

Challenges in Optimization

1. Local Minima

The loss landscape has many valleys. Gradient descent might get stuck in a local minimum instead of finding the global minimum.

Neural Networks are Different

Surprisingly, in very high-dimensional spaces (millions of parameters), local minima are rare. Most critical points are saddle points.

Saddle Points: Points where some directions go up, others go down. Gradient descent can escape these by following the downward directions.

2. Learning Rate Selection

3. Flat Regions

Some areas have very small gradients. Training can slow to a crawl.

Momentum

Momentum helps accelerate in consistent directions and dampen oscillations:

velocity = β · velocity + ∇L # Accumulate velocity weights = weights - α · velocity # Update with momentum β (momentum coefficient) is typically 0.9
Physics Analogy: Imagine a ball rolling down a hill. It builds up momentum in consistent directions and rolls through small bumps. Without momentum, it's like walking step by step.

Modern Optimizers

AdaGrad

Adapts learning rate per parameter based on historical gradients. Good for sparse data.

❌ Learning rate can decay too much

RMSprop

Fixes AdaGrad by using moving average of squared gradients. Works well for RNNs.

✓ Good for non-stationary objectives

Adam (Adaptive Moment Estimation)

Combines momentum + adaptive learning rates. Most popular optimizer.

✓ Default choice for most problems

AdamW

Adam with proper weight decay. Better generalization.

✓ Standard for training transformers

Adam Algorithm

# Adam update rule m = β₁·m + (1-β₁)·g # First moment (momentum) v = β₂·v + (1-β₂)·g² # Second moment (adaptive rate) m_hat = m / (1-β₁^t) # Bias correction v_hat = v / (1-β₂^t) w = w - α · m_hat / (√v_hat + ε) Typical values: β₁=0.9, β₂=0.999, ε=1e-8

Learning Rate Scheduling

Instead of fixed learning rate, adjust it during training:

Warmup is crucial for transformers: Start with very small LR, linearly increase for first few thousand steps, then decay. Prevents early training instability.

Exercises

Exercise 1: Gradient Descent Step

Weight w = 5.0, gradient ∂L/∂w = 2.0, learning rate α = 0.1. What is w after one update?

Exercise 2: Momentum

With momentum β=0.9, if gradients are consistently 1.0, what does the velocity converge to?

Exercise 3: Learning Rate

Why might we want to decrease learning rate as training progresses?