Lesson 2: Activation Functions

Why We Need Non-Linearity

A neural network without activation functions is just a linear transformation:

Without activation: output = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂) = Wx + b

This is just a single linear transformation! No matter how many layers, 
it's equivalent to one matrix multiplication.
      

        Key Insight: Stacking linear layers gives you... another linear layer. 
        We need non-linear activation functions to create expressive, non-linear mappings.
      

With activation functions:

With activation: output = W₂ * activation(W₁x + b₁) + b₂

Now we have non-linearity! The network can learn complex, curved decision boundaries.

Common Activation Functions

Sigmoid

σ(x) = 1 / (1 + e^(-x))

Output range: (0, 1). Smooth gradient. Used in gates (LSTM).

❌ Vanishing gradient problem

Tanh

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Output range: (-1, 1). Zero-centered. Better than sigmoid.

❌ Still has vanishing gradients

ReLU (Rectified Linear Unit)

ReLU(x) = max(0, x)

Simple, fast, no vanishing gradient for positive values.

❌ "Dying ReLU" problem (neurons can get stuck at 0)

Leaky ReLU

f(x) = x if x > 0, else αx (α = 0.01)

Small negative slope fixes dying ReLU problem.

✓ No dying neurons

GELU (Gaussian Error Linear Unit)

GELU(x) = x · Φ(x) where Φ is CDF of Gaussian

Smooth, probabilistic. Used in BERT, GPT, Transformers.

✓ Smooth gradients, performs well

Swish / SiLU

Swish(x) = x · σ(x) = x / (1 + e^(-x))

Self-gated activation. Smooth, non-monotonic.

✓ Found by automated search, works well

The Vanishing Gradient Problem

Why did ReLU replace sigmoid and tanh?

The Problem

Sigmoid derivative: σ'(x) = σ(x)(1 - σ(x))

Maximum value of σ'(x) is 0.25 (at x = 0)

          Chain Rule Multiplication:

          In a deep network with n layers, gradients get multiplied n times.

          If each gradient is ≤ 0.25, after 10 layers: 0.25^10 ≈ 0.00000095

          The gradient essentially vanishes!

ReLU derivative: f'(x) = 1 if x > 0, else 0

For positive activations, gradient = 1 (no vanishing!)

Which Activation for LLMs?

Modern LLMs almost exclusively use GELU or Swish:

BERT: GELU
GPT-2/3/4: GELU
T5: ReLU (in feed-forward), GELU in some variants
LLaMA: SwiGLU (a variant of Swish)

        GELU Approximation:

        GELU(x) ≈ 0.5x(1 + tanh[√(2/π)(x + 0.044715x³)])

        This is fast to compute and differentiable everywhere.

Exercises

Exercise 1: Derivative Calculation

Calculate the derivative of ReLU(x) = max(0, x). What is it at x = 5? At x = -3? At x = 0?

Exercise 2: Gradient Flow

In a 20-layer network using sigmoid activations, if the average gradient through each layer is 0.2, what is the gradient at the first layer? What if using ReLU with average gradient 0.5?

Exercise 3: Activation Choice

Why might GELU be preferred over ReLU in transformers, despite ReLU being simpler?