Level 02: Neural Networks

The Perceptron: Where It All Began

In 1958, Frank Rosenblatt introduced the perceptron — the simplest neural network. It's a single neuron that takes inputs, applies weights, and produces an output. Everything in modern deep learning builds on this foundation.

🧠 The Perceptron

x₁

x₂

x₃

w₁ = 0.5

w₂ = -0.3

w₃ = 0.8

Σ + b

→

output

output = σ(w₁x₁ + w₂x₂ + w₃x₃ + b)

Inputs (x) are multiplied by weights (w), summed, have a bias (b) added, then pass through an activation function (σ)

What Does a Neuron Actually Do?

Think of a neuron as a weighted voting system:

Positive weights → "This input supports the output"
Negative weights → "This input opposes the output"
Large magnitude → "This input matters a lot"
Near zero → "This input doesn't matter"

          Example: A spam detector neuron might have:

          • High positive weight for "viagra"

          • High negative weight for "meeting scheduled"

          • Near-zero weight for "the"

Activation Functions: Adding Non-Linearity

Without activation functions, a neural network would just be a linear function — no matter how many layers you stack, you'd still just have a fancy linear regression. Activation functions introduce the non-linearity that lets networks learn complex patterns.

Common Activation Functions

Activation Function Comparison

ReLU (Rectified Linear Unit)

f(x) = max(0, x)

Most popular. Simple, fast, avoids vanishing gradients.

Sigmoid

f(x) = 1 / (1 + e⁻ˣ)

Outputs 0-1. Good for probabilities. Suffers from vanishing gradients.

Tanh

f(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)

Outputs -1 to 1. Zero-centered. Still has vanishing gradient issues.

GELU (Modern Choice)

f(x) = x \cdot Φ(x) where Φ is Gaussian CDF

Used in GPT, BERT. Smooth, probabilistic interpretation.

          Why ReLU is Popular:

          1. Computationally cheap (just max(0, x))

          2. Doesn't saturate for positive values (no vanishing gradient)

          3. Induces sparsity (many neurons output exactly 0)

          4. Biologically plausible (neurons don't fire below threshold)

Multi-Layer Networks: Going Deep

A single neuron can only learn linear patterns. To learn complex functions, we stack neurons in layers. This is what makes a network "deep."

Multi-Layer Perceptron (MLP)

x₁

x₂

x₃

Input Layer

→

h₁

h₂

h₃

h₄

Hidden Layer

→

output

Output Layer

Information flows from input → hidden → output. Each connection has a weight.

Why Depth Matters

The key insight: each hidden layer learns increasingly abstract features:

Layer 1: Detects simple patterns (edges, colors, basic shapes)
Layer 2: Combines simple patterns (corners, textures)
Layer 3: Combines into complex features (eyes, wheels, letters)
Deeper layers: High-level concepts (faces, cars, words)

          Universal Approximation Theorem: A neural network with just one hidden layer 
          can approximate any continuous function... but might need exponentially many neurons. 
          Depth allows more efficient representation of complex functions.
        

The Forward Pass: How Data Flows

When you input data into a neural network, it flows through each layer in the forward pass. Let's see exactly what happens mathematically.

import numpy as np

# A simple 2-layer neural network
def forward_pass(x, W1, b1, W2, b2):
    # Layer 1: Input (3 features) -> Hidden (4 neurons)
    z1 = np.dot(W1, x) + b1      # Linear transformation
    h = np.maximum(0, z1)       # ReLU activation
    
    # Layer 2: Hidden (4 neurons) -> Output (1 neuron)
    z2 = np.dot(W2, h) + b2      # Linear transformation
    output = 1 / (1 + np.exp(-z2))  # Sigmoid for probability
    
    return output

# Example dimensions
x = np.random.randn(3, 1)      # Input: 3 features
W1 = np.random.randn(4, 3)     # Weights: 4 hidden neurons × 3 inputs
b1 = np.zeros((4, 1))          # Bias: 4 hidden neurons
W2 = np.random.randn(1, 4)     # Weights: 1 output × 4 hidden
b2 = np.zeros((1, 1))          # Bias: 1 output

prediction = forward_pass(x, W1, b1, W2, b2)
print(f"Output: {prediction[0][0]:.4f}")
        

Matrix Multiplication View

The forward pass is really just a series of matrix operations:

h = ReLU(W₁x + b₁) ŷ = σ(W₂h + b₂)

This is why GPUs are so important for deep learning — they're designed for fast matrix multiplication!

What Neural Networks Can Learn

Neural networks are universal function approximators. Given enough neurons and layers, they can learn:

Classification: Is this email spam or not?
Regression: What's the predicted house price?
Pattern recognition: What digit is in this image?
Sequence modeling: What word comes next?
Function approximation: Any continuous mathematical function

          The Catch: Networks can learn these things in theory, but actually 
          training them requires good initialization, proper optimization, enough data, and careful 
          architecture design. That's what we'll cover in Level 4.
        

From Neural Networks to LLMs

Traditional neural networks process fixed-size inputs. But language is sequential and variable-length. This is where recurrent neural networks and later transformers come in.

In the next level, we'll see how the transformer architecture revolutionized language modeling by processing entire sequences in parallel while maintaining contextual understanding.

Loss Functions: Measuring Error

To train a neural network, we need a way to measure how wrong its predictions are. The loss function (also called "cost function" or "objective function") quantifies this error. Training means finding parameters that minimize this loss.

Mean Squared Error (MSE)

For regression problems (predicting continuous values), the most common loss is MSE:

L_MSE = (1/n) Σᵢ (ŷᵢ - yᵢ)²

Where ŷᵢ is the prediction and yᵢ is the true value. MSE penalizes large errors heavily (because of the squaring), which makes it sensitive to outliers.

Cross-Entropy Loss

For classification and language modeling, cross-entropy is the standard:

L_CE = -(1/n) Σᵢ Σⱼ yᵢⱼ \cdot log(ŷᵢⱼ)

For language models, this simplifies because yᵢⱼ is one-hot (only the true next token has value 1):

L = -(1/T) Σₜ log P(xₜ | x₁,...,xₜ₋₁)

          Why cross-entropy? Three reasons: (1) It's the negative log-likelihood, 
          so minimizing it maximizes the probability of the training data. (2) When combined with softmax, 
          it produces clean gradients that don't saturate. (3) It directly measures the "surprise" 
          of the model at each prediction — lower surprise means better predictions.
        

Binary Cross-Entropy

For yes/no decisions (spam or not, etc.):

L_BCE = -(1/n) Σᵢ [yᵢ \cdot log(ŷᵢ) + (1-yᵢ) \cdot log(1-ŷᵢ)]

This is just cross-entropy for 2-class classification. Used in the reward model during RLHF training.

Backpropagation: The Engine of Learning

Backpropagation is the algorithm that makes training neural networks possible. It efficiently computes the gradient of the loss with respect to every parameter in the network, using the chain rule from calculus.

Intuition: The Blame Game

Think of a neural network as a chain of functions:

The Forward and Backward Pass

            Forward pass:

            x → [W₁,b₁] → z₁ → [ReLU] → h₁ → [W₂,b₂] → z₂ → [ReLU] → h₂ → [W₃,b₃] → ŷ → [Loss] → L

            Backward pass (gradients):

            ∂L/∂W₃ ← ∂L/∂ŷ ← ∂L/∂h₂ ← ∂L/∂z₂ ← ∂L/∂W₂ ← ∂L/∂h₁ ← ∂L/∂z₁ ← ∂L/∂W₁

The key insight: each layer only needs local information to compute its gradients:

The gradient from above: How much does the loss change when this layer's output changes?
The local derivative: How does this layer's output change when its input changes?
Multiply them: Chain rule → you have the gradient for this layer!

Step-by-Step Example

Let's trace through a tiny 2-layer network computing y = σ(W₂ · ReLU(W₁ · x)):

# Forward pass
z₁ = W₁ · x + b₁          # Linear transformation, layer 1
h₁ = ReLU(z₁)               # Activation
z₂ = W₂ · h₁ + b₂          # Linear transformation, layer 2
ŷ  = sigmoid(z₂)            # Output probability
L  = -[y·log(ŷ) + (1-y)·log(1-ŷ)]   # Binary cross-entropy loss

# Backward pass (computing gradients)
# Step 1: Gradient of loss w.r.t. output
dL_dŷ = (ŷ - y) / (ŷ · (1 - ŷ))    # deriv of cross-entropy + sigmoid

# Step 2: Through layer 2
dL_dz₂ = dL_dŷ · σ'(z₂)           # = ŷ - y (beautiful simplification!)
dL_dW₂ = dL_dz₂ · h₁^T            # outer product
dL_db₂ = dL_dz₂                    # bias gradient

# Step 3: Through ReLU
dL_dh₁ = W₂^T · dL_dz₂            # gradient flowing back
dL_dz₁ = dL_dh₁ · 𝟙(z₁ > 0)     # ReLU derivative: 1 if active, 0 if not

# Step 4: Through layer 1
dL_dW₁ = dL_dz₁ · x^T             # outer product
dL_db₁ = dL_dz₁                    # bias gradient
        

          Key Observation: The gradient dL/dz₂ simplifies to just (ŷ - y) when using 
          cross-entropy loss with sigmoid output. This is not a coincidence — it's why cross-entropy 
          is paired with softmax/sigmoid: they produce clean, non-saturating gradients.
        

Computational Cost

For a network with N parameters, a naive approach would require N separate forward passes to compute all gradients. Backpropagation computes all N gradients in just 2 passes (one forward, one backward) — essentially the same cost as 2 forward passes. This is what makes training billion-parameter models feasible.

Gradient Descent: Finding the Minimum

Once we have gradients from backpropagation, we use them to update the parameters. Gradient descent is the simplest update rule: move parameters in the direction that reduces loss.

The Basic Update Rule

θ_{t+1} = θ_t - η · ∇L(θ_t)

Where η (learning rate) controls the step size. This is the simplest form, but in practice we use more sophisticated optimizers.

Variants of Gradient Descent

Gradient Descent Variants

Method	Update Rule	Pros	Cons
SGD	θ ← θ - η · ∇L(θ)	Simple, generalizes well	Slow, noisy
SGD + Momentum	v ← βv + ∇L(θ) θ ← θ - η · v	Accelerates through valleys	Extra hyperparameter β
RMSProp	Adapts learning rate per parameter using running average of squared gradients	Handles different scales	Can be unstable
Adam	Combines momentum + RMSProp	Fast convergence, robust	May generalize slightly worse
AdamW	Adam + decoupled weight decay	Standard for LLMs	More memory (2× optimizer state)

Adam Optimizer — The Details

Adam (Adaptive Moment Estimation) maintains exponentially moving averages of both the gradient and the squared gradient:

m_t = β₁ · m_{t-1} + (1-β₁) · g_t (1st moment: running avg of gradient) v_t = β₂ · v_{t-1} + (1-β₂) · g_t² (2nd moment: running avg of squared gradient) m̂_t = m_t / (1 - β₁ᵗ) (bias correction) v̂_t = v_t / (1 - β₂ᵗ) (bias correction) θ_t = θ_{t-1} - η · m̂_t / (√v̂_t + ε) (parameter update)

Default hyperparameters: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸, η = 3×10⁻⁴ (for pre-training).

          Why Adam Works: The first moment (m) provides momentum, smoothing out noisy gradients. 
          The second moment (v) adapts the learning rate for each parameter — parameters with large gradients 
          get smaller steps, and parameters with small gradients get larger steps. This is crucial for LLMs 
          where some parameters see frequent updates (common word embeddings) and others see rare updates 
          (rare word embeddings).
        

Training in Practice: Regularization & Tricks

Getting neural networks to train well requires more than just the right architecture and optimizer. Several techniques prevent overfitting and stabilize training.

1. Weight Decay (L2 Regularization)

Weight decay adds a penalty for large weights to the loss function:

L_total = L_original + (λ/2) \cdot Σᵢ wᵢ²

This prevents weights from growing too large, which reduces overfitting. In AdamW, weight decay is decoupled from the gradient-based update, which works better than L2 regularization with Adam.

2. Dropout

During training, randomly set some activations to zero with probability p:

h_dropout[i] = h[i] / (1-p) with probability (1-p) h_dropout[i] = 0 with probability p

This forces the network to not rely too heavily on any single neuron. During inference, all neurons are active but outputs are scaled by (1-p) to compensate.

3. Layer Normalization

LayerNorm(x) = (x - μ) / (σ² + ε)^½ \cdot γ + β

Where μ and σ are computed over the features of each token (not the batch). This stabilizes training by ensuring activations stay in a reasonable range. Used in every transformer block.

4. Gradient Clipping

if ‖\nablaL‖ > threshold: \nablaL \leftarrow \nablaL \cdot (threshold / ‖\nablaL‖)

Prevents exploding gradients by capping the gradient norm. Essential for training transformers — without it, a single bad gradient update can destroy the model's learning.

5. Learning Rate Warmup

Start training with a very small learning rate and gradually increase it:

η_t = (t / T_warmup) \cdot η_max for t < T_warmup η_t = schedule(t) for t \geq T_warmup

This prevents destructive gradient updates early in training when the model's parameters are random and gradients can be large. Typical warmup: 2000-10000 steps.

From Neural Networks to Language Models

Traditional feedforward networks (the kind we've been discussing) process fixed-size inputs and produce fixed-size outputs. But language is a sequence — variable length, order matters, and context is crucial.

The Sequence Problem

How do we handle variable-length input? Several approaches have been tried:

Architecture Evolution

Architecture	How It Handles Sequences	Key Limitation
Bag of Words	Ignore order, average all word embeddings	Loses all positional information
RNN / LSTM	Process one token at a time, maintain hidden state	Sequential (can't parallelize), forgets early tokens
CNN	Sliding window over the sequence	Limited receptive field, can't attend globally
Transformer	Attention: every token can attend to every token	O(n²) memory/compute for sequence length n

Why Transformers Won

Parallelizable: All positions are computed simultaneously (not sequentially like RNNs)
Global context: Every token can directly attend to every other token
Scalable: Works well with GPU hardware — essentially massive matrix multiplications
Long-range dependencies: Attention has no "memory distance" — token 1 attends to token 1000 as easily as token 2

In the next level, we'll dive deep into the transformer architecture — how attention works mathematically, multi-head attention, positional encodings, and how these pieces combine to create the architectures behind GPT, BERT, and all modern LLMs.

          What You Should Know Going Forward: The key takeaway from this level is that 
          neural networks are composition of simple functions: linear transformations + non-linear activations. 
          They learn through backpropagation (chain rule applied to computational graphs) and gradient descent. 
          Everything in transformers — attention, feed-forward layers, layer normalization — follows these 
          same principles. The transformer just arranges these building blocks differently.