🚧 Lesson 8 of 8 in Level 02 — Final Lesson!
Level 02 • Lesson 8

Deep Networks

Why depth matters. From shallow networks to deep learning revolution.

The Power of Depth

Why do we want deep networks? Why not just one very wide layer?

Theoretical Result: A single hidden layer can approximate any function (Universal Approximation Theorem). But it might need exponentially many neurons.

Depth allows efficient representation:

Depth vs Width Tradeoff

1

Hidden Layer

Universal but inefficient

5

Hidden Layers

Good for many tasks

50

Hidden Layers

ResNet, computer vision

96

Transformer Layers

GPT-3, modern LLMs

Hierarchical Features

Deep networks learn hierarchical representations:

# Image recognition example Layer 1-2: Edges, corners, simple patterns Layer 3-5: Textures, simple shapes Layer 6-10: Object parts (wheels, eyes) Layer 11+: Complete objects (cars, faces)

Each layer builds on the previous, composing simple features into complex ones.

For Language:
Lower layers: Characters → Words → Syntax
Middle layers: Phrases → Sentences → Meaning
Upper layers: Paragraphs → Documents → Context

Training Deep Networks

Training very deep networks used to be hard due to:

Solutions

Problem Solution
Vanishing gradients ReLU, Residual connections
Exploding gradients Gradient clipping
Degradation Residual connections, skip connections

Residual Connections

The key innovation that enabled very deep networks:

# Standard layer output = f(input) # Residual layer (ResNet) output = input + f(input) # If f(input) is hard to learn, can learn f(input) ≈ 0 # Then output ≈ input (identity is easy)
Why it works: Residual connections create shortcuts for gradients to flow. Even if some layers don't learn much, the gradient can still propagate. Also makes it easy for layers to learn "do nothing" (identity).

Transformers use residual connections extensively:

# In a transformer block x = x + Attention(LayerNorm(x)) x = x + FFN(LayerNorm(x))

Modern Deep Learning

Today's LLMs combine many techniques:

Depth + Width + Data + Compute: The recipe for modern LLMs. Each component matters. Scale is the secret ingredient.
🎉

Level 02 Complete!

You now understand neural networks: perceptrons, activation functions, backpropagation, optimization, and deep learning.

Ready for Level 03: Transformers?