Lesson 5: Optimization

Gradient Descent

The basic algorithm for training neural networks:

Initialize weights randomly
Repeat until convergence:
    1. Compute gradients: ∇L = ∂Loss/∂W
    2. Update weights: W = W - α · ∇L

Where α (learning rate) controls step size
      

        Intuition: Imagine walking down a hill in fog. You can only feel the slope under your feet. 
        Gradient descent says: "Walk in the direction that feels steepest downward."
      

Types of Gradient Descent

Variant	How it works	Pros/Cons
Batch GD	Use all training data	Stable but slow
Stochastic GD	One example at a time	Fast but noisy
Mini-batch GD	Small batches (32-512)	Best of both (standard)

Challenges in Optimization

1. Local Minima

The loss landscape has many valleys. Gradient descent might get stuck in a local minimum instead of finding the global minimum.

Neural Networks are Different

Surprisingly, in very high-dimensional spaces (millions of parameters), local minima are rare. Most critical points are saddle points.

          Saddle Points: Points where some directions go up, others go down. 
          Gradient descent can escape these by following the downward directions.
        

2. Learning Rate Selection

Too large: Overshoot minimum, might diverge
Too small: Very slow convergence, gets stuck
Just right: Steady convergence to minimum

3. Flat Regions

Some areas have very small gradients. Training can slow to a crawl.

Momentum

Momentum helps accelerate in consistent directions and dampen oscillations:

velocity = β · velocity + ∇L        # Accumulate velocity
weights = weights - α · velocity    # Update with momentum

β (momentum coefficient) is typically 0.9
      

        Physics Analogy: Imagine a ball rolling down a hill. It builds up momentum 
        in consistent directions and rolls through small bumps. Without momentum, it's like walking 
        step by step.
      

Modern Optimizers

AdaGrad

Adapts learning rate per parameter based on historical gradients. Good for sparse data.

❌ Learning rate can decay too much

RMSprop

Fixes AdaGrad by using moving average of squared gradients. Works well for RNNs.

✓ Good for non-stationary objectives

Adam (Adaptive Moment Estimation)

Combines momentum + adaptive learning rates. Most popular optimizer.

✓ Default choice for most problems

AdamW

Adam with proper weight decay. Better generalization.

✓ Standard for training transformers

Adam Algorithm

# Adam update rule
m = β₁·m + (1-β₁)·g        # First moment (momentum)
v = β₂·v + (1-β₂)·g²       # Second moment (adaptive rate)

m_hat = m / (1-β₁^t)       # Bias correction
v_hat = v / (1-β₂^t)

w = w - α · m_hat / (√v_hat + ε)

Typical values: β₁=0.9, β₂=0.999, ε=1e-8
      

Learning Rate Scheduling

Instead of fixed learning rate, adjust it during training:

Step decay: Drop LR by factor every N epochs
Exponential decay: LR = LR₀ · e^(-kt)
Cosine annealing: Smooth decrease following cosine curve
Warmup: Start small, gradually increase, then decay

        Warmup is crucial for transformers: Start with very small LR, linearly increase 
        for first few thousand steps, then decay. Prevents early training instability.
      

Exercises

Exercise 1: Gradient Descent Step

Weight w = 5.0, gradient ∂L/∂w = 2.0, learning rate α = 0.1. What is w after one update?

Exercise 2: Momentum

With momentum β=0.9, if gradients are consistently 1.0, what does the velocity converge to?

Exercise 3: Learning Rate

Why might we want to decrease learning rate as training progresses?