Lesson 4: Gradients & Derivatives

Derivatives

The derivative measures rate of change:

# f(x) = x^2
# f'(x) = 2x

# At x=3, slope is 6
# Small change in x → 6x change in f(x)
      

Partial Derivatives

For functions of multiple variables:

# f(x,y) = x^2 + y^2
# ∂f/∂x = 2x  (treat y as constant)
# ∂f/∂y = 2y  (treat x as constant)

# Gradient (vector of partial derivatives):
∇f = [∂f/∂x, ∂f/∂y] = [2x, 2y]
      

The Jacobian

Matrix of all first-order partial derivatives:

# For f: R^n → R^m, Jacobian is m×n matrix
# J[i,j] = ∂f_i/∂x_j

# Used in backpropagation to chain gradients
      

Practice Exercises

Exercise 1: Compute Gradients Manually

For the function f(x, y, z) = x²y + yz³, compute:

∂f/∂x
∂f/∂y
∂f/∂z

Answer: ∂f/∂x = 2xy, ∂f/∂y = x² + z³, ∂f/∂z = 3yz²

Exercise 2: Implement Gradient Descent Step

Complete the Python function to perform one gradient descent update:

def gradient_descent_step(x, y, learning_rate=0.1):
    """
    For f(x,y) = x² + y², perform one gradient descent step.
    Returns: new_x, new_y
    """
    # Compute gradients
    grad_x = ____  # ∂f/∂x
    grad_y = ____  # ∂f/∂y
    
    # Update parameters
    new_x = x - learning_rate * grad_x
    new_y = y - learning_rate * grad_y
    
    return new_x, new_y

# Test: starting from (3, 4), after one step with lr=0.1:
# Expected: new_x = 2.4, new_y = 3.2
        

Solution: grad_x = 2*x, grad_y = 2*y. Starting at (3,4): new_x = 3 - 0.1*6 = 2.4, new_y = 4 - 0.1*8 = 3.2

Knowledge Check Quiz

Question 1: Gradient Direction

What does the gradient vector ∇f point to?

A) The direction of steepest descent
B) The direction of steepest ascent
C) A local minimum
D) The origin

Answer: B — The gradient points in the direction of steepest ascent (maximum rate of increase).

Question 2: Partial Derivative

For f(x,y) = x³y², what is ∂f/∂x?

A) 3x²y²
B) x³ · 2y
C) 3x² + 2y
D) 6xy

Answer: A — Treat y as constant: ∂f/∂x = 3x² · y² = 3x²y²

Question 3: Jacobian Dimensions

If f: ℝ⁵ → ℝ³, what are the dimensions of the Jacobian matrix?

A) 5×3
B) 3×5
C) 5×5
D) 3×3

Answer: B — The Jacobian is m×n where m=output dim (3) and n=input dim (5), so 3×5.

Question 4: Gradient Descent Update

Why do we subtract the gradient (not add it) in gradient descent?

A) To move toward lower loss
B) To increase the learning rate
C) To compute the Jacobian
D) It's just a convention

Answer: A — We subtract because the gradient points toward steepest ascent; subtracting moves us toward steepest descent (lower loss).

Additional Coding Exercises

Exercise 3: Implement the Jacobian

Write a function to compute the Jacobian matrix for a vector-valued function:

import numpy as np

def jacobian(f, x, h=1e-5):
    """
    Compute Jacobian of f at point x using finite differences.
    f: function that takes vector x and returns vector
    x: input vector (numpy array)
    h: step size for finite difference
    Returns: Jacobian matrix J where J[i,j] = ∂f_i/∂x_j
    """
    n = len(x)
    fx = f(x)
    m = len(fx)
    J = np.zeros((m, n))
    
    for j in range(n):
        x_plus = x.copy()
        x_plus[j] += h
        # TODO: compute partial derivative for column j
        J[:, j] = (f(x_plus) - fx) / h
    
    return J

# Test: f(x,y) = [x²+y, xy]
def test_func(v):
    x, y = v[0], v[1]
    return np.array([x**2 + y, x*y])

# At point (1, 2), expected Jacobian:
# J = [[2x, 1], [y, x]] = [[2, 1], [2, 1]]
print(jacobian(test_func, np.array([1.0, 2.0])))
        

Exercise 4: Chain Rule Implementation

Implement the chain rule for backpropagation through a simple neural network layer:

def linear_layer_backward(x, w, b, grad_output):
    """
    Backward pass through linear layer: y = x @ w + b
    
    Args:
        x: input (batch_size, in_features)
        w: weights (in_features, out_features)
        b: bias (out_features,)
        grad_output: gradient from next layer (batch_size, out_features)
    
    Returns:
        grad_x, grad_w, grad_b: gradients w.r.t. inputs
    """
    # TODO: Compute gradients using chain rule
    # grad_w = x.T @ grad_output
    # grad_b = sum(grad_output, axis=0)
    # grad_x = grad_output @ w.T
    
    grad_w = x.T @ grad_output
    grad_b = np.sum(grad_output, axis=0)
    grad_x = grad_output @ w.T
    
    return grad_x, grad_w, grad_b

# Verify shapes
import numpy as np
x = np.random.randn(4, 3)   # batch=4, in=3
w = np.random.randn(3, 2)   # in=3, out=2
b = np.random.randn(2)      # out=2
grad_out = np.random.randn(4, 2)

gx, gw, gb = linear_layer_backward(x, w, b, grad_out)
print(f"grad_x shape: {gx.shape}, expected: (4, 3)")
print(f"grad_w shape: {gw.shape}, expected: (3, 2)")
print(f"grad_b shape: {gb.shape}, expected: (2,)")