Lesson 5: Gradient Computation

The Chain Rule

For composed functions:

# If y = f(g(x)), then:
dy/dx = dy/dg * dg/dx

# Example: y = (2x + 1)^2
# Let u = 2x + 1, then y = u^2
# dy/dx = 2u * 2 = 4(2x + 1)
      

Backpropagation

Applying chain rule to neural networks:

# Loss L depends on weights through many layers
# ∂L/∂W1 = ∂L/∂y * ∂y/∂h2 * ∂h2/∂h1 * ∂h1/∂W1

# Compute forward (store activations)
# Compute backward (chain gradients)
      

Example: Simple Network

# y = σ(Wx + b)
# ∂y/∂W = σ'(Wx+b) * x
# ∂y/∂b = σ'(Wx+b) * 1

# Gradient tells us how to update weights
      

📝 Quick Quiz

Q1: If y = f(g(x)), what is dy/dx according to the chain rule?

A) dy/dx = dy/dg + dg/dx
B) dy/dx = dy/dg × dg/dx ✓
C) dy/dx = dy/dg - dg/dx
D) dy/dx = (dy/dg) / (dg/dx)

Q2: For y = (3x + 2)³, what is dy/dx?

Let u = 3x + 2, then y = u³
dy/dx = 3u² × 3 = 9(3x + 2)² ✓

Q3: In backpropagation, why do we store activations during the forward pass?

A) To reduce memory usage
B) To compute gradients during the backward pass ✓
C) To initialize weights
D) To speed up inference

Q4: For y = σ(Wx + b), what is ∂y/∂b?

∂y/∂b = σ'(Wx + b) × 1 = σ'(Wx + b) ✓

Q5: What does the gradient tell us in neural network training?

A) The final prediction accuracy
B) The direction and magnitude of weight updates ✓
C) The number of layers needed
D) The input data distribution

💻 Coding Exercises

Exercise 1: Manual Backpropagation

Implement a simple 2-layer neural network and compute gradients manually:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1 - s)

# Network: x → h → y
# h = sigmoid(W1*x + b1)
# y = sigmoid(W2*h + b2)

def forward(x, W1, b1, W2, b2):
    z1 = W1 @ x + b1
    h = sigmoid(z1)
    z2 = W2 @ h + b2
    y = sigmoid(z2)
    return y, h, z1, z2  # save for backprop

def backward(x, target, y, h, z1, z2, W2):
    # Output layer gradients
    dL_dy = 2 * (y - target)  # MSE derivative
    dy_dz2 = sigmoid_deriv(z2)
    dL_dz2 = dL_dy * dy_dz2
    
    dL_dW2 = dL_dz2 @ h.T
    dL_db2 = dL_dz2
    
    # Hidden layer gradients (chain rule)
    dL_dh = W2.T @ dL_dz2
    dh_dz1 = sigmoid_deriv(z1)
    dL_dz1 = dL_dh * dh_dz1
    
    dL_dW1 = dL_dz1 @ x.T
    dL_db1 = dL_dz1
    
    return dL_dW1, dL_db1, dL_dW2, dL_db2

# Test with sample data
x = np.array([[0.5], [0.3]])
target = np.array([[1.0]])
W1, b1 = np.random.randn(3, 2), np.random.randn(3, 1)
W2, b2 = np.random.randn(1, 3), np.random.randn(1, 1)

y, h, z1, z2 = forward(x, W1, b1, W2, b2)
grads = backward(x, target, y, h, z1, z2, W2)
print(f"Output: {y}")
print(f"Gradients computed for all weights!")

Exercise 2: Chain Rule Verification

Verify the chain rule numerically using finite differences:

import numpy as np

def f(x):
    """f(x) = (2x + 1)^2"""
    return (2*x + 1)**2

def df_analytical(x):
    """Analytical derivative: 4(2x + 1)"""
    return 4 * (2*x + 1)

def df_numerical(x, eps=1e-5):
    """Numerical derivative using finite differences"""
    return (f(x + eps) - f(x - eps)) / (2 * eps)

# Test at multiple points
test_points = [0.0, 1.0, -0.5, 2.5]
print("Chain Rule Verification:")
print("-" * 40)
for x in test_points:
    analytical = df_analytical(x)
    numerical = df_numerical(x)
    error = abs(analytical - numerical)
    print(f"x={x:5.2f}: analytical={analytical:10.6f}, "
          f"numerical={numerical:10.6f}, error={error:.2e}")

# Now verify for composite: g(x) = sin(x^2)
def g(x):
    return np.sin(x**2)

def dg_chain_rule(x):
    # dg/dx = cos(x^2) * 2x
    return np.cos(x**2) * 2*x

def dg_numerical(x, eps=1e-5):
    return (g(x + eps) - g(x - eps)) / (2 * eps)

print("\nComposite Function g(x) = sin(x²):")
print("-" * 40)
for x in [0.5, 1.0, 2.0]:
    chain = dg_chain_rule(x)
    numerical = dg_numerical(x)
    print(f"x={x}: chain_rule={chain:.6f}, numerical={numerical:.6f}")

Expected Output: Errors should be < 1e-10, confirming the chain rule implementation.