The Power of Depth
Why do we want deep networks? Why not just one very wide layer?
Depth allows efficient representation:
- Shallow network: Might need 2^n neurons
- Deep network: Can represent the same function with O(n) neurons
Depth vs Width Tradeoff
Hidden Layer
Universal but inefficient
Hidden Layers
Good for many tasks
Hidden Layers
ResNet, computer vision
Transformer Layers
GPT-3, modern LLMs
Hierarchical Features
Deep networks learn hierarchical representations:
Each layer builds on the previous, composing simple features into complex ones.
Lower layers: Characters → Words → Syntax
Middle layers: Phrases → Sentences → Meaning
Upper layers: Paragraphs → Documents → Context
Training Deep Networks
Training very deep networks used to be hard due to:
- Vanishing gradients: Gradients get smaller in earlier layers
- Exploding gradients: Gradients get exponentially larger
- Degradation: Adding layers hurts performance (not just overfitting)
Solutions
| Problem | Solution |
|---|---|
| Vanishing gradients | ReLU, Residual connections |
| Exploding gradients | Gradient clipping |
| Degradation | Residual connections, skip connections |
Residual Connections
The key innovation that enabled very deep networks:
Transformers use residual connections extensively:
Modern Deep Learning
Today's LLMs combine many techniques:
- Depth: 12-96+ transformer layers
- Width: 768-12,288+ hidden dimensions
- Residual connections: Enable deep training
- LayerNorm: Stabilize training
- Attention: Enable long-range dependencies
Level 02 Complete!
You now understand neural networks: perceptrons, activation functions, backpropagation, optimization, and deep learning.
Ready for Level 03: Transformers?