Lesson 3: Backpropagation

The Learning Problem

We have a neural network with millions of parameters (weights and biases). How do we find the values that minimize the loss function?

        The Core Challenge: We can't try all possible weight combinations (exponential search space). 
        We need an efficient way to improve weights based on their effect on the loss.
      

The Solution: Gradient Descent

Compute the loss (how wrong are our predictions?)
Calculate gradients (how does each weight affect the loss?)
Update weights in the direction that reduces loss
Repeat

The Chain Rule

Backpropagation is just the chain rule from calculus, applied repeatedly through the network.

Simple Chain Rule Example

If y = f(g(x)), then dy/dx = dy/dg · dg/dx

x

→

g(x)

→

f(g(x)) = y

To find how y changes with x, multiply the derivatives along the path.

In Neural Networks

A neural network is a composition of functions:

output = f_L(f_{L-1}(...f_2(f_1(x))...))

Where each f_i is: activation(W_i · input + b_i)

To get ∂Loss/∂W₁, we apply the chain rule through all layers:

∂Loss/∂W₁ = ∂Loss/∂output · ∂output/∂f_L · ∂f_L/∂f_{L-1} · ... · ∂f_2/∂f_1 · ∂f_1/∂W₁
      

Forward Pass vs Backward Pass

Forward Pass

Compute predictions

            Input → Layer 1 → Layer 2 → ... → Output

            (Compute activations)

Compute Loss

How wrong are we?

            Loss = (prediction - target)²

            (or cross-entropy, etc.)

Backward Pass

Compute gradients

            Output → ... → Layer 2 → Layer 1

            (Chain rule backwards)

Update Weights

Gradient descent

            W_new = W_old - α·∂Loss/∂W

            (α = learning rate)

Intuition for Gradients

What is a Gradient?

The gradient tells us:

Direction: Which way increases the loss most
Magnitude: How sensitive the loss is to this weight

          Key Insight: We want to decrease loss, so we move in the opposite direction 
          of the gradient. Hence "gradient descent."
        

# Weight update rule
for each weight w:
    gradient = ∂Loss/∂w    # How much does w affect loss?
    w = w - α * gradient   # Move opposite to gradient
        

Computational Graph

Modern deep learning frameworks (PyTorch, TensorFlow, JAX) build a computational graph during the forward pass. This graph records all operations, enabling automatic differentiation.

import torch

# Define computation
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3*x + 1  # Some computation

# Compute gradients
y.backward()  # Backpropagation!

print(x.grad)  # dy/dx = 2x + 3 = 7 when x=2
      

The framework automatically:

Tracks all operations on tensors with requires_grad=True
Builds a computation graph
Applies chain rule backwards when .backward() is called
Accumulates gradients in .grad attribute

Why Backprop is Efficient

Computing gradients naively would be O(n²) where n is the number of parameters. Backprop makes it O(n).

        The Trick: Reuse computations! When computing ∂Loss/∂W₁, we compute intermediate values 
        (∂Loss/∂Layer_L, ∂Loss/∂Layer_{L-1}, etc.) that are also needed for earlier layers. Backprop goes 
        backwards, caching these values.
      

Without backprop: Compute each gradient independently → O(n²)
With backprop: Share intermediate computations → O(n)

Exercises

Exercise 1: Chain Rule Practice

Let y = (3x + 2)². Find dy/dx using the chain rule.

Exercise 2: Gradient Flow

In a network with 3 layers, if the gradient at the output is 0.5, and the local gradients through each layer are 0.8, 0.6, and 0.4 respectively, what is the gradient at the first layer?

Exercise 3: Weight Update

A weight w = 2.0 has gradient ∂Loss/∂w = -0.5. With learning rate α = 0.1, what is the new weight value?