The Perceptron: Where It All Began
In 1958, Frank Rosenblatt introduced the perceptron — the simplest neural network. It's a single neuron that takes inputs, applies weights, and produces an output. Everything in modern deep learning builds on this foundation.
🧠 The Perceptron
Inputs (x) are multiplied by weights (w), summed, have a bias (b) added, then pass through an activation function (σ)
What Does a Neuron Actually Do?
Think of a neuron as a weighted voting system:
- Positive weights → "This input supports the output"
- Negative weights → "This input opposes the output"
- Large magnitude → "This input matters a lot"
- Near zero → "This input doesn't matter"
• High positive weight for "viagra"
• High negative weight for "meeting scheduled"
• Near-zero weight for "the"
Activation Functions: Adding Non-Linearity
Without activation functions, a neural network would just be a linear function — no matter how many layers you stack, you'd still just have a fancy linear regression. Activation functions introduce the non-linearity that lets networks learn complex patterns.
Common Activation Functions
Activation Function Comparison
ReLU (Rectified Linear Unit)
Most popular. Simple, fast, avoids vanishing gradients.
Sigmoid
Outputs 0-1. Good for probabilities. Suffers from vanishing gradients.
Tanh
Outputs -1 to 1. Zero-centered. Still has vanishing gradient issues.
GELU (Modern Choice)
Used in GPT, BERT. Smooth, probabilistic interpretation.
1. Computationally cheap (just max(0, x))
2. Doesn't saturate for positive values (no vanishing gradient)
3. Induces sparsity (many neurons output exactly 0)
4. Biologically plausible (neurons don't fire below threshold)
Multi-Layer Networks: Going Deep
A single neuron can only learn linear patterns. To learn complex functions, we stack neurons in layers. This is what makes a network "deep."
Multi-Layer Perceptron (MLP)
Information flows from input → hidden → output. Each connection has a weight.
Why Depth Matters
The key insight: each hidden layer learns increasingly abstract features:
- Layer 1: Detects simple patterns (edges, colors, basic shapes)
- Layer 2: Combines simple patterns (corners, textures)
- Layer 3: Combines into complex features (eyes, wheels, letters)
- Deeper layers: High-level concepts (faces, cars, words)
The Forward Pass: How Data Flows
When you input data into a neural network, it flows through each layer in the forward pass. Let's see exactly what happens mathematically.
Matrix Multiplication View
The forward pass is really just a series of matrix operations:
ŷ = σ(W₂h + b₂)
This is why GPUs are so important for deep learning — they're designed for fast matrix multiplication!
What Neural Networks Can Learn
Neural networks are universal function approximators. Given enough neurons and layers, they can learn:
- Classification: Is this email spam or not?
- Regression: What's the predicted house price?
- Pattern recognition: What digit is in this image?
- Sequence modeling: What word comes next?
- Function approximation: Any continuous mathematical function
From Neural Networks to LLMs
Traditional neural networks process fixed-size inputs. But language is sequential and variable-length. This is where recurrent neural networks and later transformers come in.
In the next level, we'll see how the transformer architecture revolutionized language modeling by processing entire sequences in parallel while maintaining contextual understanding.
Loss Functions: Measuring Error
To train a neural network, we need a way to measure how wrong its predictions are. The loss function (also called "cost function" or "objective function") quantifies this error. Training means finding parameters that minimize this loss.
Mean Squared Error (MSE)
For regression problems (predicting continuous values), the most common loss is MSE:
Where ŷᵢ is the prediction and yᵢ is the true value. MSE penalizes large errors heavily (because of the squaring), which makes it sensitive to outliers.
Cross-Entropy Loss
For classification and language modeling, cross-entropy is the standard:
For language models, this simplifies because yᵢⱼ is one-hot (only the true next token has value 1):
Binary Cross-Entropy
For yes/no decisions (spam or not, etc.):
This is just cross-entropy for 2-class classification. Used in the reward model during RLHF training.
Backpropagation: The Engine of Learning
Backpropagation is the algorithm that makes training neural networks possible. It efficiently computes the gradient of the loss with respect to every parameter in the network, using the chain rule from calculus.
Intuition: The Blame Game
Think of a neural network as a chain of functions:
The Forward and Backward Pass
x → [W₁,b₁] → z₁ → [ReLU] → h₁ → [W₂,b₂] → z₂ → [ReLU] → h₂ → [W₃,b₃] → ŷ → [Loss] → L
Backward pass (gradients):
∂L/∂W₃ ← ∂L/∂ŷ ← ∂L/∂h₂ ← ∂L/∂z₂ ← ∂L/∂W₂ ← ∂L/∂h₁ ← ∂L/∂z₁ ← ∂L/∂W₁
The key insight: each layer only needs local information to compute its gradients:
- The gradient from above: How much does the loss change when this layer's output changes?
- The local derivative: How does this layer's output change when its input changes?
- Multiply them: Chain rule → you have the gradient for this layer!
Step-by-Step Example
Let's trace through a tiny 2-layer network computing y = σ(W₂ · ReLU(W₁ · x)):
Computational Cost
For a network with N parameters, a naive approach would require N separate forward passes to compute all gradients. Backpropagation computes all N gradients in just 2 passes (one forward, one backward) — essentially the same cost as 2 forward passes. This is what makes training billion-parameter models feasible.
Gradient Descent: Finding the Minimum
Once we have gradients from backpropagation, we use them to update the parameters. Gradient descent is the simplest update rule: move parameters in the direction that reduces loss.
The Basic Update Rule
Where η (learning rate) controls the step size. This is the simplest form, but in practice we use more sophisticated optimizers.
Variants of Gradient Descent
Gradient Descent Variants
| Method | Update Rule | Pros | Cons |
|---|---|---|---|
| SGD | θ ← θ - η · ∇L(θ) | Simple, generalizes well | Slow, noisy |
| SGD + Momentum | v ← βv + ∇L(θ) θ ← θ - η · v |
Accelerates through valleys | Extra hyperparameter β |
| RMSProp | Adapts learning rate per parameter using running average of squared gradients | Handles different scales | Can be unstable |
| Adam | Combines momentum + RMSProp | Fast convergence, robust | May generalize slightly worse |
| AdamW | Adam + decoupled weight decay | Standard for LLMs | More memory (2× optimizer state) |
Adam Optimizer — The Details
Adam (Adaptive Moment Estimation) maintains exponentially moving averages of both the gradient and the squared gradient:
v_t = β₂ · v_{t-1} + (1-β₂) · g_t² (2nd moment: running avg of squared gradient)
m̂_t = m_t / (1 - β₁ᵗ) (bias correction)
v̂_t = v_t / (1 - β₂ᵗ) (bias correction)
θ_t = θ_{t-1} - η · m̂_t / (√v̂_t + ε) (parameter update)
Default hyperparameters: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸, η = 3×10⁻⁴ (for pre-training).
Training in Practice: Regularization & Tricks
Getting neural networks to train well requires more than just the right architecture and optimizer. Several techniques prevent overfitting and stabilize training.
1. Weight Decay (L2 Regularization)
Weight decay adds a penalty for large weights to the loss function:
This prevents weights from growing too large, which reduces overfitting. In AdamW, weight decay is decoupled from the gradient-based update, which works better than L2 regularization with Adam.
2. Dropout
During training, randomly set some activations to zero with probability p:
h_dropout[i] = 0 with probability p
This forces the network to not rely too heavily on any single neuron. During inference, all neurons are active but outputs are scaled by (1-p) to compensate.
3. Layer Normalization
Where μ and σ are computed over the features of each token (not the batch). This stabilizes training by ensuring activations stay in a reasonable range. Used in every transformer block.
4. Gradient Clipping
∇L ← ∇L · (threshold / ‖∇L‖)
Prevents exploding gradients by capping the gradient norm. Essential for training transformers — without it, a single bad gradient update can destroy the model's learning.
5. Learning Rate Warmup
Start training with a very small learning rate and gradually increase it:
η_t = schedule(t) for t ≥ T_warmup
This prevents destructive gradient updates early in training when the model's parameters are random and gradients can be large. Typical warmup: 2000-10000 steps.
From Neural Networks to Language Models
Traditional feedforward networks (the kind we've been discussing) process fixed-size inputs and produce fixed-size outputs. But language is a sequence — variable length, order matters, and context is crucial.
The Sequence Problem
How do we handle variable-length input? Several approaches have been tried:
Architecture Evolution
| Architecture | How It Handles Sequences | Key Limitation |
|---|---|---|
| Bag of Words | Ignore order, average all word embeddings | Loses all positional information |
| RNN / LSTM | Process one token at a time, maintain hidden state | Sequential (can't parallelize), forgets early tokens |
| CNN | Sliding window over the sequence | Limited receptive field, can't attend globally |
| Transformer | Attention: every token can attend to every token | O(n²) memory/compute for sequence length n |
Why Transformers Won
- Parallelizable: All positions are computed simultaneously (not sequentially like RNNs)
- Global context: Every token can directly attend to every other token
- Scalable: Works well with GPU hardware — essentially massive matrix multiplications
- Long-range dependencies: Attention has no "memory distance" — token 1 attends to token 1000 as easily as token 2
In the next level, we'll dive deep into the transformer architecture — how attention works mathematically, multi-head attention, positional encodings, and how these pieces combine to create the architectures behind GPT, BERT, and all modern LLMs.