🚧 Lesson 3 of 8 in Level 02
Level 02 • Lesson 3

Backpropagation

How neural networks learn. The chain rule, gradient computation, and weight updates.

The Learning Problem

We have a neural network with millions of parameters (weights and biases). How do we find the values that minimize the loss function?

The Core Challenge: We can't try all possible weight combinations (exponential search space). We need an efficient way to improve weights based on their effect on the loss.

The Solution: Gradient Descent

  1. Compute the loss (how wrong are our predictions?)
  2. Calculate gradients (how does each weight affect the loss?)
  3. Update weights in the direction that reduces loss
  4. Repeat

The Chain Rule

Backpropagation is just the chain rule from calculus, applied repeatedly through the network.

Simple Chain Rule Example

If y = f(g(x)), then dy/dx = dy/dg · dg/dx

x
g(x)
f(g(x)) = y

To find how y changes with x, multiply the derivatives along the path.

In Neural Networks

A neural network is a composition of functions:

output = f_L(f_{L-1}(...f_2(f_1(x))...)) Where each f_i is: activation(W_i · input + b_i)

To get ∂Loss/∂W₁, we apply the chain rule through all layers:

∂Loss/∂W₁ = ∂Loss/∂output · ∂output/∂f_L · ∂f_L/∂f_{L-1} · ... · ∂f_2/∂f_1 · ∂f_1/∂W₁

Forward Pass vs Backward Pass

Forward Pass

Compute predictions

Input → Layer 1 → Layer 2 → ... → Output
(Compute activations)

Compute Loss

How wrong are we?

Loss = (prediction - target)²
(or cross-entropy, etc.)

Backward Pass

Compute gradients

Output → ... → Layer 2 → Layer 1
(Chain rule backwards)

Update Weights

Gradient descent

W_new = W_old - α·∂Loss/∂W
(α = learning rate)

Intuition for Gradients

What is a Gradient?

The gradient tells us:

  • Direction: Which way increases the loss most
  • Magnitude: How sensitive the loss is to this weight
Key Insight: We want to decrease loss, so we move in the opposite direction of the gradient. Hence "gradient descent."
# Weight update rule for each weight w: gradient = ∂Loss/∂w # How much does w affect loss? w = w - α * gradient # Move opposite to gradient

Computational Graph

Modern deep learning frameworks (PyTorch, TensorFlow, JAX) build a computational graph during the forward pass. This graph records all operations, enabling automatic differentiation.

import torch # Define computation x = torch.tensor(2.0, requires_grad=True) y = x ** 2 + 3*x + 1 # Some computation # Compute gradients y.backward() # Backpropagation! print(x.grad) # dy/dx = 2x + 3 = 7 when x=2

The framework automatically:

  1. Tracks all operations on tensors with requires_grad=True
  2. Builds a computation graph
  3. Applies chain rule backwards when .backward() is called
  4. Accumulates gradients in .grad attribute

Why Backprop is Efficient

Computing gradients naively would be O(n²) where n is the number of parameters. Backprop makes it O(n).

The Trick: Reuse computations! When computing ∂Loss/∂W₁, we compute intermediate values (∂Loss/∂Layer_L, ∂Loss/∂Layer_{L-1}, etc.) that are also needed for earlier layers. Backprop goes backwards, caching these values.

Without backprop: Compute each gradient independently → O(n²)
With backprop: Share intermediate computations → O(n)

Exercises

Exercise 1: Chain Rule Practice

Let y = (3x + 2)². Find dy/dx using the chain rule.

Exercise 2: Gradient Flow

In a network with 3 layers, if the gradient at the output is 0.5, and the local gradients through each layer are 0.8, 0.6, and 0.4 respectively, what is the gradient at the first layer?

Exercise 3: Weight Update

A weight w = 2.0 has gradient ∂Loss/∂w = -0.5. With learning rate α = 0.1, what is the new weight value?