🚧 Level 2 • Neural Networks — From Perceptron to Deep Learning
Level 02

Neural Networks

From the perceptron to deep networks. Understand how neurons learn, why depth matters, and the building blocks of modern AI.

30 Lessons Beginner-Intermediate

The Perceptron: Where It All Began

In 1958, Frank Rosenblatt introduced the perceptron — the simplest neural network. It's a single neuron that takes inputs, applies weights, and produces an output. Everything in modern deep learning builds on this foundation.

🧠 The Perceptron

x₁
x₂
x₃
w₁ = 0.5
w₂ = -0.3
w₃ = 0.8
Σ + b
output
output = σ(w₁x₁ + w₂x₂ + w₃x₃ + b)

Inputs (x) are multiplied by weights (w), summed, have a bias (b) added, then pass through an activation function (σ)

What Does a Neuron Actually Do?

Think of a neuron as a weighted voting system:

  • Positive weights → "This input supports the output"
  • Negative weights → "This input opposes the output"
  • Large magnitude → "This input matters a lot"
  • Near zero → "This input doesn't matter"
Example: A spam detector neuron might have:
• High positive weight for "viagra"
• High negative weight for "meeting scheduled"
• Near-zero weight for "the"

Activation Functions: Adding Non-Linearity

Without activation functions, a neural network would just be a linear function — no matter how many layers you stack, you'd still just have a fancy linear regression. Activation functions introduce the non-linearity that lets networks learn complex patterns.

Common Activation Functions

Activation Function Comparison

ReLU (Rectified Linear Unit)
f(x) = max(0, x)

Most popular. Simple, fast, avoids vanishing gradients.

Sigmoid
f(x) = 1 / (1 + e⁻ˣ)

Outputs 0-1. Good for probabilities. Suffers from vanishing gradients.

Tanh
f(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)

Outputs -1 to 1. Zero-centered. Still has vanishing gradient issues.

GELU (Modern Choice)
f(x) = x · Φ(x) where Φ is Gaussian CDF

Used in GPT, BERT. Smooth, probabilistic interpretation.

Why ReLU is Popular:
1. Computationally cheap (just max(0, x))
2. Doesn't saturate for positive values (no vanishing gradient)
3. Induces sparsity (many neurons output exactly 0)
4. Biologically plausible (neurons don't fire below threshold)

Multi-Layer Networks: Going Deep

A single neuron can only learn linear patterns. To learn complex functions, we stack neurons in layers. This is what makes a network "deep."

Multi-Layer Perceptron (MLP)

x₁
x₂
x₃
Input Layer
h₁
h₂
h₃
h₄
Hidden Layer
output
Output Layer

Information flows from input → hidden → output. Each connection has a weight.

Why Depth Matters

The key insight: each hidden layer learns increasingly abstract features:

  • Layer 1: Detects simple patterns (edges, colors, basic shapes)
  • Layer 2: Combines simple patterns (corners, textures)
  • Layer 3: Combines into complex features (eyes, wheels, letters)
  • Deeper layers: High-level concepts (faces, cars, words)
Universal Approximation Theorem: A neural network with just one hidden layer can approximate any continuous function... but might need exponentially many neurons. Depth allows more efficient representation of complex functions.

The Forward Pass: How Data Flows

When you input data into a neural network, it flows through each layer in the forward pass. Let's see exactly what happens mathematically.

import numpy as np # A simple 2-layer neural network def forward_pass(x, W1, b1, W2, b2): # Layer 1: Input (3 features) -> Hidden (4 neurons) z1 = np.dot(W1, x) + b1 # Linear transformation h = np.maximum(0, z1) # ReLU activation # Layer 2: Hidden (4 neurons) -> Output (1 neuron) z2 = np.dot(W2, h) + b2 # Linear transformation output = 1 / (1 + np.exp(-z2)) # Sigmoid for probability return output # Example dimensions x = np.random.randn(3, 1) # Input: 3 features W1 = np.random.randn(4, 3) # Weights: 4 hidden neurons × 3 inputs b1 = np.zeros((4, 1)) # Bias: 4 hidden neurons W2 = np.random.randn(1, 4) # Weights: 1 output × 4 hidden b2 = np.zeros((1, 1)) # Bias: 1 output prediction = forward_pass(x, W1, b1, W2, b2) print(f"Output: {prediction[0][0]:.4f}")

Matrix Multiplication View

The forward pass is really just a series of matrix operations:

h = ReLU(W₁x + b₁)

ŷ = σ(W₂h + b₂)

This is why GPUs are so important for deep learning — they're designed for fast matrix multiplication!

What Neural Networks Can Learn

Neural networks are universal function approximators. Given enough neurons and layers, they can learn:

  • Classification: Is this email spam or not?
  • Regression: What's the predicted house price?
  • Pattern recognition: What digit is in this image?
  • Sequence modeling: What word comes next?
  • Function approximation: Any continuous mathematical function
The Catch: Networks can learn these things in theory, but actually training them requires good initialization, proper optimization, enough data, and careful architecture design. That's what we'll cover in Level 4.

From Neural Networks to LLMs

Traditional neural networks process fixed-size inputs. But language is sequential and variable-length. This is where recurrent neural networks and later transformers come in.

In the next level, we'll see how the transformer architecture revolutionized language modeling by processing entire sequences in parallel while maintaining contextual understanding.

Loss Functions: Measuring Error

To train a neural network, we need a way to measure how wrong its predictions are. The loss function (also called "cost function" or "objective function") quantifies this error. Training means finding parameters that minimize this loss.

Mean Squared Error (MSE)

For regression problems (predicting continuous values), the most common loss is MSE:

L_MSE = (1/n) Σᵢ (ŷᵢ - yᵢ)²

Where ŷᵢ is the prediction and yᵢ is the true value. MSE penalizes large errors heavily (because of the squaring), which makes it sensitive to outliers.

Cross-Entropy Loss

For classification and language modeling, cross-entropy is the standard:

L_CE = -(1/n) Σᵢ Σⱼ yᵢⱼ · log(ŷᵢⱼ)

For language models, this simplifies because yᵢⱼ is one-hot (only the true next token has value 1):

L = -(1/T) Σₜ log P(xₜ | x₁,...,xₜ₋₁)
Why cross-entropy? Three reasons: (1) It's the negative log-likelihood, so minimizing it maximizes the probability of the training data. (2) When combined with softmax, it produces clean gradients that don't saturate. (3) It directly measures the "surprise" of the model at each prediction — lower surprise means better predictions.

Binary Cross-Entropy

For yes/no decisions (spam or not, etc.):

L_BCE = -(1/n) Σᵢ [yᵢ · log(ŷᵢ) + (1-yᵢ) · log(1-ŷᵢ)]

This is just cross-entropy for 2-class classification. Used in the reward model during RLHF training.

Backpropagation: The Engine of Learning

Backpropagation is the algorithm that makes training neural networks possible. It efficiently computes the gradient of the loss with respect to every parameter in the network, using the chain rule from calculus.

Intuition: The Blame Game

Think of a neural network as a chain of functions:

The Forward and Backward Pass

Forward pass:
x → [W₁,b₁] → z₁ → [ReLU] → h₁ → [W₂,b₂] → z₂ → [ReLU] → h₂ → [W₃,b₃] → ŷ → [Loss] → L

Backward pass (gradients):
∂L/∂W₃ ← ∂L/∂ŷ ← ∂L/∂h₂ ← ∂L/∂z₂ ← ∂L/∂W₂ ← ∂L/∂h₁ ← ∂L/∂z₁ ← ∂L/∂W₁

The key insight: each layer only needs local information to compute its gradients:

  • The gradient from above: How much does the loss change when this layer's output changes?
  • The local derivative: How does this layer's output change when its input changes?
  • Multiply them: Chain rule → you have the gradient for this layer!

Step-by-Step Example

Let's trace through a tiny 2-layer network computing y = σ(W₂ · ReLU(W₁ · x)):

# Forward pass z₁ = W₁ · x + b₁ # Linear transformation, layer 1 h₁ = ReLU(z₁) # Activation z₂ = W₂ · h₁ + b₂ # Linear transformation, layer 2 ŷ = sigmoid(z₂) # Output probability L = -[y·log(ŷ) + (1-y)·log(1-ŷ)] # Binary cross-entropy loss # Backward pass (computing gradients) # Step 1: Gradient of loss w.r.t. output dL_dŷ = (ŷ - y) / (ŷ · (1 - ŷ)) # deriv of cross-entropy + sigmoid # Step 2: Through layer 2 dL_dz₂ = dL_dŷ · σ'(z₂) # = ŷ - y (beautiful simplification!) dL_dW₂ = dL_dz₂ · h₁^T # outer product dL_db₂ = dL_dz₂ # bias gradient # Step 3: Through ReLU dL_dh₁ = W₂^T · dL_dz₂ # gradient flowing back dL_dz₁ = dL_dh₁ · 𝟙(z₁ > 0) # ReLU derivative: 1 if active, 0 if not # Step 4: Through layer 1 dL_dW₁ = dL_dz₁ · x^T # outer product dL_db₁ = dL_dz₁ # bias gradient
Key Observation: The gradient dL/dz₂ simplifies to just (ŷ - y) when using cross-entropy loss with sigmoid output. This is not a coincidence — it's why cross-entropy is paired with softmax/sigmoid: they produce clean, non-saturating gradients.

Computational Cost

For a network with N parameters, a naive approach would require N separate forward passes to compute all gradients. Backpropagation computes all N gradients in just 2 passes (one forward, one backward) — essentially the same cost as 2 forward passes. This is what makes training billion-parameter models feasible.

Gradient Descent: Finding the Minimum

Once we have gradients from backpropagation, we use them to update the parameters. Gradient descent is the simplest update rule: move parameters in the direction that reduces loss.

The Basic Update Rule

θ_{t+1} = θ_t - η · ∇L(θ_t)

Where η (learning rate) controls the step size. This is the simplest form, but in practice we use more sophisticated optimizers.

Variants of Gradient Descent

Gradient Descent Variants

Method Update Rule Pros Cons
SGD θ ← θ - η · ∇L(θ) Simple, generalizes well Slow, noisy
SGD + Momentum v ← βv + ∇L(θ)
θ ← θ - η · v
Accelerates through valleys Extra hyperparameter β
RMSProp Adapts learning rate per parameter using running average of squared gradients Handles different scales Can be unstable
Adam Combines momentum + RMSProp Fast convergence, robust May generalize slightly worse
AdamW Adam + decoupled weight decay Standard for LLMs More memory (2× optimizer state)

Adam Optimizer — The Details

Adam (Adaptive Moment Estimation) maintains exponentially moving averages of both the gradient and the squared gradient:

m_t = β₁ · m_{t-1} + (1-β₁) · g_t (1st moment: running avg of gradient)
v_t = β₂ · v_{t-1} + (1-β₂) · g_t² (2nd moment: running avg of squared gradient)
m̂_t = m_t / (1 - β₁ᵗ) (bias correction)
v̂_t = v_t / (1 - β₂ᵗ) (bias correction)
θ_t = θ_{t-1} - η · m̂_t / (√v̂_t + ε) (parameter update)

Default hyperparameters: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸, η = 3×10⁻⁴ (for pre-training).

Why Adam Works: The first moment (m) provides momentum, smoothing out noisy gradients. The second moment (v) adapts the learning rate for each parameter — parameters with large gradients get smaller steps, and parameters with small gradients get larger steps. This is crucial for LLMs where some parameters see frequent updates (common word embeddings) and others see rare updates (rare word embeddings).

Training in Practice: Regularization & Tricks

Getting neural networks to train well requires more than just the right architecture and optimizer. Several techniques prevent overfitting and stabilize training.

1. Weight Decay (L2 Regularization)

Weight decay adds a penalty for large weights to the loss function:

L_total = L_original + (λ/2) · Σᵢ wᵢ²

This prevents weights from growing too large, which reduces overfitting. In AdamW, weight decay is decoupled from the gradient-based update, which works better than L2 regularization with Adam.

2. Dropout

During training, randomly set some activations to zero with probability p:

h_dropout[i] = h[i] / (1-p)    with probability (1-p)
h_dropout[i] = 0                with probability p

This forces the network to not rely too heavily on any single neuron. During inference, all neurons are active but outputs are scaled by (1-p) to compensate.

3. Layer Normalization

LayerNorm(x) = (x - μ) / (σ² + ε)^½ · γ + β

Where μ and σ are computed over the features of each token (not the batch). This stabilizes training by ensuring activations stay in a reasonable range. Used in every transformer block.

4. Gradient Clipping

if ‖∇L‖ > threshold:
  ∇L ← ∇L · (threshold / ‖∇L‖)

Prevents exploding gradients by capping the gradient norm. Essential for training transformers — without it, a single bad gradient update can destroy the model's learning.

5. Learning Rate Warmup

Start training with a very small learning rate and gradually increase it:

η_t = (t / T_warmup) · η_max          for t < T_warmup
η_t = schedule(t)                for t ≥ T_warmup

This prevents destructive gradient updates early in training when the model's parameters are random and gradients can be large. Typical warmup: 2000-10000 steps.

From Neural Networks to Language Models

Traditional feedforward networks (the kind we've been discussing) process fixed-size inputs and produce fixed-size outputs. But language is a sequence — variable length, order matters, and context is crucial.

The Sequence Problem

How do we handle variable-length input? Several approaches have been tried:

Architecture Evolution

Architecture How It Handles Sequences Key Limitation
Bag of Words Ignore order, average all word embeddings Loses all positional information
RNN / LSTM Process one token at a time, maintain hidden state Sequential (can't parallelize), forgets early tokens
CNN Sliding window over the sequence Limited receptive field, can't attend globally
Transformer Attention: every token can attend to every token O(n²) memory/compute for sequence length n

Why Transformers Won

  • Parallelizable: All positions are computed simultaneously (not sequentially like RNNs)
  • Global context: Every token can directly attend to every other token
  • Scalable: Works well with GPU hardware — essentially massive matrix multiplications
  • Long-range dependencies: Attention has no "memory distance" — token 1 attends to token 1000 as easily as token 2

In the next level, we'll dive deep into the transformer architecture — how attention works mathematically, multi-head attention, positional encodings, and how these pieces combine to create the architectures behind GPT, BERT, and all modern LLMs.

What You Should Know Going Forward: The key takeaway from this level is that neural networks are composition of simple functions: linear transformations + non-linear activations. They learn through backpropagation (chain rule applied to computational graphs) and gradient descent. Everything in transformers — attention, feed-forward layers, layer normalization — follows these same principles. The transformer just arranges these building blocks differently.