The Four Pillars
Understanding LLMs deeply requires four areas of mathematics. Don't worry — we'll build intuition first, then rigor. Each topic connects directly to how models work.
Mathematical Foundations
Linear Algebra
- Vectors & matrices
- Matrix multiplication
- Eigenvalues & eigenvectors
- SVD & low-rank approx
- Attention as matrix ops
Calculus
- Derivatives & gradients
- Chain rule
- Partial derivatives
- Backpropagation
- Optimization
Probability
- Random variables
- Distributions
- Expectation & variance
- Maximum likelihood
- Bayesian inference
Information Theory
- Entropy
- Cross-entropy loss
- KL divergence
- Perplexity
- Compression
Linear Algebra: The Language of Deep Learning
Neural networks are fundamentally about linear transformations (matrix multiplication) followed by non-linear activations. Understanding linear algebra is understanding how information flows through networks.
Matrix Multiplication as Composition
When we multiply matrices, we're composing transformations:
This says: transform vector x using transformation W. In neural networks:
- x is the input (e.g., token embedding)
- W contains the learned weights
- y is the transformed representation
Matrix Multiplication Visualization
Each output yᵢ = Σⱼ wᵢⱼ · xⱼ (dot product of row i with x)
Attention as Matrix Operations
The attention mechanism is elegantly expressed as matrix operations:
• Q, K, V each have shape (seq_len × d_model)
• QK^T has shape (seq_len × seq_len) — the attention scores
• Softmax output · V has shape (seq_len × d_model) — attended values
This single matrix operation computes all pairwise attention in parallel!
Calculus: How Networks Learn
Training neural networks is optimization — finding weights that minimize loss. This requires computing how changing each weight affects the loss: the gradient.
The Derivative
The derivative tells us the rate of change:
In neural networks, we need partial derivatives — how does the loss change with respect to each individual weight?
The Chain Rule
Neural networks are compositions of functions. The chain rule tells us how to differentiate them:
Chain Rule Derivation
Given: L = f(g(h(w)))
w → h(w) → g(h) → f(g) → L
dL/dw = dL/df · df/dg · dg/dh · dh/dw
∂L/∂w = ∂L/∂output · ∂output/∂activation · ∂activation/∂z · ∂z/∂w
We compute gradients backward from output to input, reusing intermediate values.
Probability: Modeling Uncertainty
Language is inherently uncertain. Probability gives us the tools to model this uncertainty and make predictions under uncertainty.
Maximum Likelihood Estimation
Language models are trained with maximum likelihood: find parameters that make the training data most probable.
For language modeling, this becomes predicting the next token:
Conditional Probability
The chain rule of probability lets us decompose joint distributions:
This is exactly what autoregressive language models do! They model P(xᵢ | x<ᵢ) for each position.
Sampling from Models
To generate text, we sample from the predicted distribution:
Different sampling strategies (greedy, temperature, top-k, top-p) modify this distribution to control randomness vs. coherence.
Information Theory: Measuring Information
Information theory, founded by Claude Shannon, gives us precise ways to measure information and uncertainty. It's fundamental to understanding loss functions.
Entropy
Entropy measures uncertainty in a distribution:
Higher entropy = more uncertainty. A uniform distribution has maximum entropy; a deterministic distribution has zero entropy.
Cross-Entropy
Cross-entropy measures how well distribution q approximates distribution p:
In language modeling:
- p is the true distribution (one-hot: the actual next token)
- q is the model's predicted distribution
- Minimizing cross-entropy = making model predictions match reality
Perplexity
Perplexity is the standard metric for language models:
KL Divergence
KL divergence measures how one distribution differs from another:
Used in RLHF to keep the fine-tuned model close to the base model (preventing catastrophic forgetting).
Linear Algebra: Worked Examples
Let's work through the key linear algebra operations that appear in transformers, step by step.
Matrix Multiplication in Attention
Consider computing QK^T for a 3-token sequence with d_k = 2:
Q = | 1.0 0.5 | (token "The")
| 0.3 0.8 | (token "cat")
| 0.1 0.9 | (token "sat")
Keys (3 × 2):
K = | 0.9 0.2 | (token "The")
| 0.4 0.7 | (token "cat")
| 0.2 0.8 | (token "sat")
Step 1: K^T (2 × 3):
K^T = | 0.9 0.4 0.2 |
| 0.2 0.7 0.8 |
Step 2: QK^T (3 × 3):
QK^T[0][0] = 1.0×0.9 + 0.5×0.2 = 1.0 (The → The)
QK^T[0][1] = 1.0×0.4 + 0.5×0.7 = 0.75 (The → cat)
QK^T[0][2] = 1.0×0.2 + 0.5×0.8 = 0.6 (The → sat)
...
Step 3: Scale by 1/√d_k = 1/√2 ≈ 0.707
scaled[0][0] = 1.0 × 0.707 = 0.707
scaled[0][1] = 0.75 × 0.707 = 0.530
scaled[0][2] = 0.6 × 0.707 = 0.424
Step 4: Softmax (row 0)
exp(0.707) = 2.028, exp(0.530) = 1.699, exp(0.424) = 1.528
sum = 5.255
attention[0] = [0.386, 0.324, 0.291]
Interpretation: "The" attends 38.6% to itself, 32.4% to "cat", 29.1% to "sat"
Eigenvalues and Attention
The attention matrix A = softmax(QK^T/√d_k) is a row-stochastic matrix (rows sum to 1). Its eigenvalues reveal important properties:
- Largest eigenvalue λ₁ = 1: Corresponds to the stationary distribution — which positions receive the most attention overall
- Other eigenvalues |λᵢ| < 1: Control how quickly attention patterns converge under repeated application
- Spectral gap = 1 - |λ₂|: A larger spectral gap means the attention is more "focused"
Low-Rank Approximation & LoRA
LoRA exploits the fact that weight updates during fine-tuning are typically low-rank. If ΔW ∈ ℝ^(d×d) has rank r ≪ d, we can factorize:
Parameters: d×r + r×d = 2dr instead of d². For d=4096, r=8: 65,536 instead of 16,777,216 — a 256× reduction!
Calculus: Derivations for Neural Networks
Derivative of the Softmax
Softmax is used everywhere in transformers. Let's derive its Jacobian:
The partial derivative ∂σ_i/∂z_j has two cases:
∂σ_i/∂z_i = σ_i(1 - σ_i)
Case 2 (i ≠ j):
∂σ_i/∂z_j = -σ_i · σ_j
Combined:
∂σ_i/∂z_j = σ_i(δ_ij - σ_j)
where δ_ij is the Kronecker delta (1 if i=j, 0 otherwise)
This is a crucial result: the softmax gradient connects the output probabilities directly to the input logits. When we combine this with cross-entropy loss, we get a beautifully simple gradient.
Cross-Entropy + Softmax Gradient Derivation
This is the most important gradient in LLM training. Let's derive it step by step.
For the correct class (i = c): = -y_c · (1 - ŷ_c) = -(1 - ŷ_c) = ŷ_c - 1
For wrong classes (i ≠ c): = -(-1) · ŷ_i = ŷ_i
The gradient of cross-entropy + softmax is simply (predicted - true)!
The Gradient of ReLU
∂ReLU(x)/∂x = 0 if x ≤ 0
ReLU's gradient is either 0 or 1. This is what makes deep networks trainable — the gradient doesn't shrink as it passes through ReLU (unlike sigmoid, where the max gradient is 0.25, causing vanishing gradients in deep networks).
Gradient of Layer Normalization
LayerNorm is more complex than it appears. Given y = γ · (x - μ)/√(σ² + ε) + β:
∂L/∂β = Σᵢ ∂L/∂yᵢ
∂L/∂xᵢ = (γ/√(σ² + ε)) · [∂L/∂yᵢ - (1/d)Σⱼ ∂L/∂yⱼ - (xᵢ - μ)/(d·σ²) Σⱼ ∂L/∂yⱼ · (xⱼ - μ)]
The gradient w.r.t. x has three terms: the direct gradient, a correction for the mean, and a correction for the variance. Thankfully, autograd handles this automatically.
Probability: Language Modeling as Probability
The Chain Rule of Probability
The most fundamental equation in autoregressive language modeling. For any joint distribution, we can factorize:
= Πᵢ₌₁ⁿ P(xᵢ | x<ᵢ)
This is an identity, not an approximation! Every joint distribution can be decomposed this way. Autoregressive language models parameterize each conditional:
where f_θ is the transformer network that produces logits for every token in the vocabulary.
Maximum Likelihood Estimation — Derivation
We want to find θ that maximizes the probability of the training data. For a dataset D = {document₁, document₂, ...}:
θ* = argmax_θ Π_{d ∈ D} P_θ(d)
Step 2: Convert product to sum (log is monotonic, so same optimum)
θ* = argmax_θ Σ_{d ∈ D} log P_θ(d)
Step 3: Factorize using chain rule
θ* = argmax_θ Σ_{d ∈ D} Σᵢ log P_θ(xᵢ^d | x<ᵢ^d)
Step 4: Flip sign (minimizing negative = maximizing positive)
θ* = argmin_θ -Σ_{d ∈ D} Σᵢ log P_θ(xᵢ^d | x<ᵢ^d)
Step 5: Average over tokens
θ* = argmin_θ -(1/N) Σᵢ log P_θ(xᵢ | x<ᵢ)
This is the cross-entropy loss! MLE and cross-entropy minimization are equivalent.
Conditional Independence and the Markov Property
Computing P(xᵢ | x<ᵢ) requires conditioning on all previous tokens. This is expensive for long sequences. If we make a Markov assumption:
This approximation is used in n-gram models (k = n-1). Transformer models make no Markov assumption — they condition on the full context. This is what gives them their power but also their O(T²) computational cost.
Bayes' Theorem in LLMs
Bayesian thinking appears in several places in LLM training:
- Weight decay as a prior: L2 regularization is equivalent to placing a Gaussian prior on weights
- Dropout as Bayesian approximation: Monte Carlo dropout approximates Bayesian inference
- RLHF reward model: P(human_preference | model_A, model_B) uses Bayesian ranking
- In-context learning: The model implicitly performs Bayesian inference over task hypotheses given the prompt
Information Theory: Proofs and Connections
Why Cross-Entropy?
Why is cross-entropy the "right" loss function? It's not arbitrary — it's the unique loss function with certain desirable properties.
Cross-entropy decomposes as:
Where H(p) is the entropy of the true distribution (a constant during training) and D_KL(p || q) is the KL divergence between the true and predicted distributions. Since H(p) doesn't depend on θ, minimizing H(p, q) is equivalent to minimizing D_KL(p || q).
Proof:
= -Σ_x p(x) log q(x) + Σ_x p(x) log p(x) - Σ_x p(x) log p(x)
= [-Σ_x p(x) log p(x)] + [Σ_x p(x) log p(x) - Σ_x p(x) log q(x)]
= H(p) + D_KL(p || q) ■
Perplexity: Information-Theoretic Interpretation
Perplexity PPL = 2^H(p, q) (for log base 2) has a clean interpretation:
- PPL = 1: Perfect prediction (every token gets probability 1). The model is certain about every next token.
- PPL = |V|: Uniform distribution. The model is maximally uncertain.
- PPL = 10: The model is, on average, as uncertain as if choosing uniformly among 10 tokens.
Perplexity in Practice
| Model | Perplexity (Wiki) | Effective Choices |
|---|---|---|
| Random (V=50K) | 50,000 | Every token equally likely |
| Trigram model | ~200 | ~200 reasonable next words |
| GPT-2 | ~30-40 | ~30-40 words on average |
| GPT-3 | ~15-20 | ~15-20 words on average |
| GPT-4 (estimated) | ~10-15 | ~10-15 words on average |
Mutual Information and Attention
There's a fascinating connection between mutual information and attention:
Attention weights can be viewed as estimates of how much mutual information one token shares with another. Tokens that share high mutual information (i.e., knowing one helps predict the other) should have high attention scores.
Optimization Theory: Why Training Works
Convex vs. Non-Convex Landscape
The loss landscape of a neural network is non-convex — it has many local minima, saddle points, and flat regions. Yet neural networks train successfully. Why?
- Most local minima are good: In high dimensions, local minima tend to have similar loss values. "Bad" local minima (with much higher loss) are rare.
- Saddle points are the real enemy: Most stationary points in high-dimensional spaces are saddle points (not minima). SGD with momentum escapes them effectively.
- Gradient noise helps: The stochasticity of mini-batch gradients acts as implicit regularization, helping escape sharp minima and find flat ones.
- Overparameterization: Modern networks have far more parameters than needed. This creates many connected "valleys" in the loss landscape, almost all with similar loss.
SGD and Generalization
A surprising empirical finding: SGD with small batches generalizes better than SGD with large batches (even when they see the same total data). Why?
Var[∇L_B] = σ²/|B| (variance decreases with batch size)
The noise in small-batch SGD acts as a form of implicit regularization:
- It pushes the optimization toward flat minima (wide valleys in loss landscape)
- Flat minima generalize better because small perturbations in parameters don't cause large loss increases
- Large batches find sharper minima that don't generalize as well
Convergence Guarantees
For convex functions, SGD with learning rate η_t converges if:
Σ_{t=1}^∞ η_t² < ∞ (steps decrease fast enough)
Common choices: η_t = 1/t (converges but slow) or η_t = 1/√t (converges with rate O(1/√t)). For non-convex functions (neural networks), convergence to a global minimum is not guaranteed, but SGD typically converges to a "good enough" local minimum.
Putting It All Together
These mathematical tools combine to create modern LLMs:
Mathematics in Action
Matrix operations process the entire sequence in parallel
Chain rule computes gradients for billions of parameters
Model learns conditional distributions for next-token prediction
Measures how well model predicts true next tokens
Stochastic gradient descent with adaptive learning rates
Complete Training Loop — Mathematical View
1. Sample batch B = {x₁, ..., x_b} from training data
2. Forward pass: Compute P_θ(xᵢ | x<ᵢ) for each position in each sequence
3. Compute loss: L = -(1/|B|) Σ_{b∈B} Σᵢ log P_θ(xᵢᵇ | x<ᵢᵇ)
4. Backward pass: Compute ∂L/∂θ for all parameters using chain rule
5. Update parameters:
m_t = β₁m_{t-1} + (1-β₁)∂L/∂θ
v_t = β₂v_{t-1} + (1-β₂)(∂L/∂θ)²
θ ← θ - η · m̂_t/(√v̂_t + ε)
6. Repeat for millions of steps