The Learning Problem
We have a neural network with millions of parameters (weights and biases). How do we find the values that minimize the loss function?
The Solution: Gradient Descent
- Compute the loss (how wrong are our predictions?)
- Calculate gradients (how does each weight affect the loss?)
- Update weights in the direction that reduces loss
- Repeat
The Chain Rule
Backpropagation is just the chain rule from calculus, applied repeatedly through the network.
Simple Chain Rule Example
If y = f(g(x)), then dy/dx = dy/dg · dg/dx
To find how y changes with x, multiply the derivatives along the path.
In Neural Networks
A neural network is a composition of functions:
To get ∂Loss/∂W₁, we apply the chain rule through all layers:
Forward Pass vs Backward Pass
Forward Pass
Compute predictions
(Compute activations)
Compute Loss
How wrong are we?
(or cross-entropy, etc.)
Backward Pass
Compute gradients
(Chain rule backwards)
Update Weights
Gradient descent
(α = learning rate)
Intuition for Gradients
What is a Gradient?
The gradient tells us:
- Direction: Which way increases the loss most
- Magnitude: How sensitive the loss is to this weight
Computational Graph
Modern deep learning frameworks (PyTorch, TensorFlow, JAX) build a computational graph during the forward pass. This graph records all operations, enabling automatic differentiation.
The framework automatically:
- Tracks all operations on tensors with requires_grad=True
- Builds a computation graph
- Applies chain rule backwards when .backward() is called
- Accumulates gradients in .grad attribute
Why Backprop is Efficient
Computing gradients naively would be O(n²) where n is the number of parameters. Backprop makes it O(n).
Without backprop: Compute each gradient independently → O(n²)
With backprop: Share intermediate computations → O(n)
Exercises
Exercise 1: Chain Rule Practice
Let y = (3x + 2)². Find dy/dx using the chain rule.
Exercise 2: Gradient Flow
In a network with 3 layers, if the gradient at the output is 0.5, and the local gradients through each layer are 0.8, 0.6, and 0.4 respectively, what is the gradient at the first layer?
Exercise 3: Weight Update
A weight w = 2.0 has gradient ∂Loss/∂w = -0.5. With learning rate α = 0.1, what is the new weight value?