Why We Need Non-Linearity
A neural network without activation functions is just a linear transformation:
With activation functions:
Common Activation Functions
Sigmoid
Output range: (0, 1). Smooth gradient. Used in gates (LSTM).
Tanh
Output range: (-1, 1). Zero-centered. Better than sigmoid.
ReLU (Rectified Linear Unit)
Simple, fast, no vanishing gradient for positive values.
Leaky ReLU
Small negative slope fixes dying ReLU problem.
GELU (Gaussian Error Linear Unit)
Smooth, probabilistic. Used in BERT, GPT, Transformers.
Swish / SiLU
Self-gated activation. Smooth, non-monotonic.
The Vanishing Gradient Problem
Why did ReLU replace sigmoid and tanh?
The Problem
Sigmoid derivative: σ'(x) = σ(x)(1 - σ(x))
Maximum value of σ'(x) is 0.25 (at x = 0)
In a deep network with n layers, gradients get multiplied n times.
If each gradient is ≤ 0.25, after 10 layers: 0.25^10 ≈ 0.00000095
The gradient essentially vanishes!
ReLU derivative: f'(x) = 1 if x > 0, else 0
For positive activations, gradient = 1 (no vanishing!)
Which Activation for LLMs?
Modern LLMs almost exclusively use GELU or Swish:
- BERT: GELU
- GPT-2/3/4: GELU
- T5: ReLU (in feed-forward), GELU in some variants
- LLaMA: SwiGLU (a variant of Swish)
GELU(x) ≈ 0.5x(1 + tanh[√(2/π)(x + 0.044715x³)])
This is fast to compute and differentiable everywhere.
Exercises
Exercise 1: Derivative Calculation
Calculate the derivative of ReLU(x) = max(0, x). What is it at x = 5? At x = -3? At x = 0?
Exercise 2: Gradient Flow
In a 20-layer network using sigmoid activations, if the average gradient through each layer is 0.2, what is the gradient at the first layer? What if using ReLU with average gradient 0.5?
Exercise 3: Activation Choice
Why might GELU be preferred over ReLU in transformers, despite ReLU being simpler?