🚧 Lesson 5 of 25 in Level 05
Level 05 • Lesson 5

Gradient Computation

Manual backprop. Chain rule application.

The Chain Rule

For composed functions:

# If y = f(g(x)), then: dy/dx = dy/dg * dg/dx # Example: y = (2x + 1)^2 # Let u = 2x + 1, then y = u^2 # dy/dx = 2u * 2 = 4(2x + 1)

Backpropagation

Applying chain rule to neural networks:

# Loss L depends on weights through many layers # āˆ‚L/āˆ‚W1 = āˆ‚L/āˆ‚y * āˆ‚y/āˆ‚h2 * āˆ‚h2/āˆ‚h1 * āˆ‚h1/āˆ‚W1 # Compute forward (store activations) # Compute backward (chain gradients)

Example: Simple Network

# y = σ(Wx + b) # āˆ‚y/āˆ‚W = σ'(Wx+b) * x # āˆ‚y/āˆ‚b = σ'(Wx+b) * 1 # Gradient tells us how to update weights

šŸ“ Quick Quiz

Q1: If y = f(g(x)), what is dy/dx according to the chain rule?

A) dy/dx = dy/dg + dg/dx
B) dy/dx = dy/dg Ɨ dg/dx āœ“
C) dy/dx = dy/dg - dg/dx
D) dy/dx = (dy/dg) / (dg/dx)

Q2: For y = (3x + 2)³, what is dy/dx?

Let u = 3x + 2, then y = u³
dy/dx = 3u² Ɨ 3 = 9(3x + 2)² āœ“

Q3: In backpropagation, why do we store activations during the forward pass?

A) To reduce memory usage
B) To compute gradients during the backward pass āœ“
C) To initialize weights
D) To speed up inference

Q4: For y = σ(Wx + b), what is āˆ‚y/āˆ‚b?

āˆ‚y/āˆ‚b = σ'(Wx + b) Ɨ 1 = σ'(Wx + b) āœ“

Q5: What does the gradient tell us in neural network training?

A) The final prediction accuracy
B) The direction and magnitude of weight updates āœ“
C) The number of layers needed
D) The input data distribution

šŸ’» Coding Exercises

Exercise 1: Manual Backpropagation

Implement a simple 2-layer neural network and compute gradients manually:

import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_deriv(x): s = sigmoid(x) return s * (1 - s) # Network: x → h → y # h = sigmoid(W1*x + b1) # y = sigmoid(W2*h + b2) def forward(x, W1, b1, W2, b2): z1 = W1 @ x + b1 h = sigmoid(z1) z2 = W2 @ h + b2 y = sigmoid(z2) return y, h, z1, z2 # save for backprop def backward(x, target, y, h, z1, z2, W2): # Output layer gradients dL_dy = 2 * (y - target) # MSE derivative dy_dz2 = sigmoid_deriv(z2) dL_dz2 = dL_dy * dy_dz2 dL_dW2 = dL_dz2 @ h.T dL_db2 = dL_dz2 # Hidden layer gradients (chain rule) dL_dh = W2.T @ dL_dz2 dh_dz1 = sigmoid_deriv(z1) dL_dz1 = dL_dh * dh_dz1 dL_dW1 = dL_dz1 @ x.T dL_db1 = dL_dz1 return dL_dW1, dL_db1, dL_dW2, dL_db2 # Test with sample data x = np.array([[0.5], [0.3]]) target = np.array([[1.0]]) W1, b1 = np.random.randn(3, 2), np.random.randn(3, 1) W2, b2 = np.random.randn(1, 3), np.random.randn(1, 1) y, h, z1, z2 = forward(x, W1, b1, W2, b2) grads = backward(x, target, y, h, z1, z2, W2) print(f"Output: {y}") print(f"Gradients computed for all weights!")

Exercise 2: Chain Rule Verification

Verify the chain rule numerically using finite differences:

import numpy as np def f(x): """f(x) = (2x + 1)^2""" return (2*x + 1)**2 def df_analytical(x): """Analytical derivative: 4(2x + 1)""" return 4 * (2*x + 1) def df_numerical(x, eps=1e-5): """Numerical derivative using finite differences""" return (f(x + eps) - f(x - eps)) / (2 * eps) # Test at multiple points test_points = [0.0, 1.0, -0.5, 2.5] print("Chain Rule Verification:") print("-" * 40) for x in test_points: analytical = df_analytical(x) numerical = df_numerical(x) error = abs(analytical - numerical) print(f"x={x:5.2f}: analytical={analytical:10.6f}, " f"numerical={numerical:10.6f}, error={error:.2e}") # Now verify for composite: g(x) = sin(x^2) def g(x): return np.sin(x**2) def dg_chain_rule(x): # dg/dx = cos(x^2) * 2x return np.cos(x**2) * 2*x def dg_numerical(x, eps=1e-5): return (g(x + eps) - g(x - eps)) / (2 * eps) print("\nComposite Function g(x) = sin(x²):") print("-" * 40) for x in [0.5, 1.0, 2.0]: chain = dg_chain_rule(x) numerical = dg_numerical(x) print(f"x={x}: chain_rule={chain:.6f}, numerical={numerical:.6f}")

Expected Output: Errors should be < 1e-10, confirming the chain rule implementation.