Lesson 8: Deep Networks

The Power of Depth

Why do we want deep networks? Why not just one very wide layer?

        Theoretical Result: A single hidden layer can approximate any function 
        (Universal Approximation Theorem). But it might need exponentially many neurons.
      

Depth allows efficient representation:

Shallow network: Might need 2^n neurons
Deep network: Can represent the same function with O(n) neurons

Depth vs Width Tradeoff

Hidden Layer

Universal but inefficient

Hidden Layers

Good for many tasks

Hidden Layers

ResNet, computer vision

Transformer Layers

GPT-3, modern LLMs

Hierarchical Features

Deep networks learn hierarchical representations:

# Image recognition example

Layer 1-2:   Edges, corners, simple patterns
Layer 3-5:   Textures, simple shapes
Layer 6-10:  Object parts (wheels, eyes)
Layer 11+:   Complete objects (cars, faces)
      

Each layer builds on the previous, composing simple features into complex ones.

        For Language:

        Lower layers: Characters → Words → Syntax

        Middle layers: Phrases → Sentences → Meaning

        Upper layers: Paragraphs → Documents → Context

Training Deep Networks

Training very deep networks used to be hard due to:

Vanishing gradients: Gradients get smaller in earlier layers
Exploding gradients: Gradients get exponentially larger
Degradation: Adding layers hurts performance (not just overfitting)

Solutions

Problem	Solution
Vanishing gradients	ReLU, Residual connections
Exploding gradients	Gradient clipping
Degradation	Residual connections, skip connections

Residual Connections

The key innovation that enabled very deep networks:

# Standard layer
output = f(input)

# Residual layer (ResNet)
output = input + f(input)

# If f(input) is hard to learn, can learn f(input) ≈ 0
# Then output ≈ input (identity is easy)
      

        Why it works: Residual connections create shortcuts for gradients to flow. 
        Even if some layers don't learn much, the gradient can still propagate. 
        Also makes it easy for layers to learn "do nothing" (identity).
      

Transformers use residual connections extensively:

# In a transformer block
x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))
      

Modern Deep Learning

Today's LLMs combine many techniques:

Depth: 12-96+ transformer layers
Width: 768-12,288+ hidden dimensions
Residual connections: Enable deep training
LayerNorm: Stabilize training
Attention: Enable long-range dependencies

        Depth + Width + Data + Compute: The recipe for modern LLMs. 
        Each component matters. Scale is the secret ingredient.
      

🎉

Level 02 Complete!

You now understand neural networks: perceptrons, activation functions, backpropagation, optimization, and deep learning.

Ready for Level 03: Transformers?