Gradient Descent
The basic algorithm for training neural networks:
Types of Gradient Descent
| Variant | How it works | Pros/Cons |
|---|---|---|
| Batch GD | Use all training data | Stable but slow |
| Stochastic GD | One example at a time | Fast but noisy |
| Mini-batch GD | Small batches (32-512) | Best of both (standard) |
Challenges in Optimization
1. Local Minima
The loss landscape has many valleys. Gradient descent might get stuck in a local minimum instead of finding the global minimum.
Neural Networks are Different
Surprisingly, in very high-dimensional spaces (millions of parameters), local minima are rare. Most critical points are saddle points.
2. Learning Rate Selection
- Too large: Overshoot minimum, might diverge
- Too small: Very slow convergence, gets stuck
- Just right: Steady convergence to minimum
3. Flat Regions
Some areas have very small gradients. Training can slow to a crawl.
Momentum
Momentum helps accelerate in consistent directions and dampen oscillations:
Modern Optimizers
AdaGrad
Adapts learning rate per parameter based on historical gradients. Good for sparse data.
RMSprop
Fixes AdaGrad by using moving average of squared gradients. Works well for RNNs.
Adam (Adaptive Moment Estimation)
Combines momentum + adaptive learning rates. Most popular optimizer.
AdamW
Adam with proper weight decay. Better generalization.
Adam Algorithm
Learning Rate Scheduling
Instead of fixed learning rate, adjust it during training:
- Step decay: Drop LR by factor every N epochs
- Exponential decay: LR = LR₀ · e^(-kt)
- Cosine annealing: Smooth decrease following cosine curve
- Warmup: Start small, gradually increase, then decay
Exercises
Exercise 1: Gradient Descent Step
Weight w = 5.0, gradient ∂L/∂w = 2.0, learning rate α = 0.1. What is w after one update?
Exercise 2: Momentum
With momentum β=0.9, if gradients are consistently 1.0, what does the velocity converge to?
Exercise 3: Learning Rate
Why might we want to decrease learning rate as training progresses?