Level 05: The Mathematics

The Four Pillars

Understanding LLMs deeply requires four areas of mathematics. Don't worry — we'll build intuition first, then rigor. Each topic connects directly to how models work.

Mathematical Foundations

Linear Algebra

Vectors & matrices
Matrix multiplication
Eigenvalues & eigenvectors
SVD & low-rank approx
Attention as matrix ops

Calculus

Derivatives & gradients
Chain rule
Partial derivatives
Backpropagation
Optimization

Probability

Random variables
Distributions
Expectation & variance
Maximum likelihood
Bayesian inference

Information Theory

Entropy
Cross-entropy loss
KL divergence
Perplexity
Compression

Linear Algebra: The Language of Deep Learning

Neural networks are fundamentally about linear transformations (matrix multiplication) followed by non-linear activations. Understanding linear algebra is understanding how information flows through networks.

Matrix Multiplication as Composition

When we multiply matrices, we're composing transformations:

y = Wx

This says: transform vector x using transformation W. In neural networks:

x is the input (e.g., token embedding)
W contains the learned weights
y is the transformed representation

Matrix Multiplication Visualization

W

w₁₁

w₁₂

w₁₃

w₂₁

w₂₂

w₂₃

w₃₁

w₃₂

w₃₃

w₄₁

w₄₂

w₄₃

(4×3)

×

x

x₁

x₂

x₃

(3×1)

=

y

y₁

y₂

y₃

y₄

(4×1)

Each output yᵢ = Σⱼ wᵢⱼ · xⱼ (dot product of row i with x)

Attention as Matrix Operations

The attention mechanism is elegantly expressed as matrix operations:

Attention(Q, K, V) = softmax(QK^T / \sqrtdₖ) \cdot V

Breaking Down the Dimensions

• Q, K, V each have shape (seq_len × d_model)
• QK^T has shape (seq_len × seq_len) — the attention scores
• Softmax output · V has shape (seq_len × d_model) — attended values

This single matrix operation computes all pairwise attention in parallel!

Calculus: How Networks Learn

Training neural networks is optimization — finding weights that minimize loss. This requires computing how changing each weight affects the loss: the gradient.

The Derivative

The derivative tells us the rate of change:

f'(x) = df/dx = lim(h\to0) [f(x+h) - f(x)] / h

In neural networks, we need partial derivatives — how does the loss change with respect to each individual weight?

\partialL/\partialwᵢⱼ = "How much does changing wᵢⱼ change the loss?"

The Chain Rule

Neural networks are compositions of functions. The chain rule tells us how to differentiate them:

Chain Rule Derivation

Given: L = f(g(h(w)))

1

Identify the composition
w → h(w) → g(h) → f(g) → L

2

Apply chain rule
dL/dw = dL/df · df/dg · dg/dh · dh/dw

3

In neural network terms
∂L/∂w = ∂L/∂output · ∂output/∂activation · ∂activation/∂z · ∂z/∂w

4

This is backpropagation!
We compute gradients backward from output to input, reusing intermediate values.

          Why the chain rule matters: Without it, computing gradients for billion-parameter 
          networks would be impossible. The chain rule lets us break the problem into local computations 
          that compose together.
        

Probability: Modeling Uncertainty

Language is inherently uncertain. Probability gives us the tools to model this uncertainty and make predictions under uncertainty.

Maximum Likelihood Estimation

Language models are trained with maximum likelihood: find parameters that make the training data most probable.

θ* = argmax_θ P(data | θ)

For language modeling, this becomes predicting the next token:

L = Σᵢ log P(xᵢ | x₁, ..., xᵢ₋₁; θ)

Conditional Probability

The chain rule of probability lets us decompose joint distributions:

P(x₁, x₂, ..., xₙ) = P(x₁) \cdot P(x₂|x₁) \cdot P(x₃|x₁,x₂) \cdot ... \cdot P(xₙ|x₁,...,xₙ₋₁)

This is exactly what autoregressive language models do! They model P(xᵢ | x<ᵢ) for each position.

Sampling from Models

To generate text, we sample from the predicted distribution:

xₜ₊₁ ~ P(x | x₁, ..., xₜ)

Different sampling strategies (greedy, temperature, top-k, top-p) modify this distribution to control randomness vs. coherence.

Information Theory: Measuring Information

Information theory, founded by Claude Shannon, gives us precise ways to measure information and uncertainty. It's fundamental to understanding loss functions.

Entropy

Entropy measures uncertainty in a distribution:

H(X) = -Σᵢ P(xᵢ) \cdot log P(xᵢ)

Higher entropy = more uncertainty. A uniform distribution has maximum entropy; a deterministic distribution has zero entropy.

Cross-Entropy

Cross-entropy measures how well distribution q approximates distribution p:

H(p, q) = -Σᵢ p(xᵢ) \cdot log q(xᵢ)

In language modeling:

p is the true distribution (one-hot: the actual next token)
q is the model's predicted distribution
Minimizing cross-entropy = making model predictions match reality

Perplexity

Perplexity is the standard metric for language models:

Perplexity = exp(H(p, q)) = exp(-(1/N) \cdot Σᵢ log P(xᵢ))

          Interpreting Perplexity: Perplexity can be understood as the "effective vocabulary size" 
          — the model is as uncertain as if choosing uniformly among that many tokens. Lower is better. 
          GPT-3 achieves ~20 on web text (vs. ~50,000 vocabulary size).
        

KL Divergence

KL divergence measures how one distribution differs from another:

D_KL(p || q) = Σᵢ p(xᵢ) \cdot log(p(xᵢ)/q(xᵢ))

Used in RLHF to keep the fine-tuned model close to the base model (preventing catastrophic forgetting).

Linear Algebra: Worked Examples

Let's work through the key linear algebra operations that appear in transformers, step by step.

Matrix Multiplication in Attention

Consider computing QK^T for a 3-token sequence with d_k = 2:

          Queries (3 × 2):

          Q = | 1.0  0.5 |   (token "The")

             | 0.3  0.8 |   (token "cat")

             | 0.1  0.9 |   (token "sat")

          Keys (3 × 2):

          K = | 0.9  0.2 |   (token "The")

             | 0.4  0.7 |   (token "cat")

             | 0.2  0.8 |   (token "sat")

          Step 1: K^T (2 × 3):

          K^T = | 0.9  0.4  0.2 |

                | 0.2  0.7  0.8 |

          Step 2: QK^T (3 × 3):

          QK^T[0][0] = 1.0×0.9 + 0.5×0.2 = 1.0   (The → The)

          QK^T[0][1] = 1.0×0.4 + 0.5×0.7 = 0.75  (The → cat)

          QK^T[0][2] = 1.0×0.2 + 0.5×0.8 = 0.6   (The → sat)

          ...

          Step 3: Scale by 1/√d_k = 1/√2 ≈ 0.707

          scaled[0][0] = 1.0 × 0.707 = 0.707

          scaled[0][1] = 0.75 × 0.707 = 0.530

          scaled[0][2] = 0.6 × 0.707 = 0.424

          Step 4: Softmax (row 0)

          exp(0.707) = 2.028, exp(0.530) = 1.699, exp(0.424) = 1.528

          sum = 5.255

          attention[0] = [0.386, 0.324, 0.291]

          Interpretation: "The" attends 38.6% to itself, 32.4% to "cat", 29.1% to "sat"

Eigenvalues and Attention

The attention matrix A = softmax(QK^T/√d_k) is a row-stochastic matrix (rows sum to 1). Its eigenvalues reveal important properties:

Largest eigenvalue λ₁ = 1: Corresponds to the stationary distribution — which positions receive the most attention overall
Other eigenvalues |λᵢ| < 1: Control how quickly attention patterns converge under repeated application
Spectral gap = 1 - |λ₂|: A larger spectral gap means the attention is more "focused"

Low-Rank Approximation & LoRA

LoRA exploits the fact that weight updates during fine-tuning are typically low-rank. If ΔW ∈ ℝ^(d×d) has rank r ≪ d, we can factorize:

ΔW = B \cdot A where B \in ℝ^(d\timesr), A \in ℝ^(r\timesd)

Parameters: d×r + r×d = 2dr instead of d². For d=4096, r=8: 65,536 instead of 16,777,216 — a 256× reduction!

Calculus: Derivations for Neural Networks

Derivative of the Softmax

Softmax is used everywhere in transformers. Let's derive its Jacobian:

σ(z)_i = exp(z_i) / Σ_j exp(z_j)

The partial derivative ∂σ_i/∂z_j has two cases:

          Case 1 (i = j):

          ∂σ_i/∂z_i = σ_i(1 - σ_i)

          Case 2 (i ≠ j):

          ∂σ_i/∂z_j = -σ_i · σ_j

          Combined:

          ∂σ_i/∂z_j = σ_i(δ_ij - σ_j)

          where δ_ij is the Kronecker delta (1 if i=j, 0 otherwise)

This is a crucial result: the softmax gradient connects the output probabilities directly to the input logits. When we combine this with cross-entropy loss, we get a beautifully simple gradient.

Cross-Entropy + Softmax Gradient Derivation

This is the most important gradient in LLM training. Let's derive it step by step.

1

Setup: Given logits z, softmax probabilities ŷ = σ(z), and true label y (one-hot)

2

Loss function: L = -Σⱼ yⱼ · log(ŷⱼ) = -log(ŷ_c) where c is the correct class

3

Chain rule: ∂L/∂z_i = ∂L/∂ŷ · ∂ŷ/∂z_i

4

Compute ∂L/∂ŷⱼ: ∂L/∂ŷⱼ = -yⱼ/ŷⱼ

5

Combine: ∂L/∂z_i = Σⱼ (-yⱼ/ŷⱼ) · ŷⱼ(δ_ij - ŷ_i)
For the correct class (i = c): = -y_c · (1 - ŷ_c) = -(1 - ŷ_c) = ŷ_c - 1
For wrong classes (i ≠ c): = -(-1) · ŷ_i = ŷ_i

6

Final result: ∂L/∂z = ŷ - y
The gradient of cross-entropy + softmax is simply (predicted - true)!

          This is why cross-entropy is paired with softmax: The gradient is (prediction - target), 
          which is bounded, interpretable, and requires no softmax derivative computation during backprop. 
          This is one of the most elegant results in deep learning.
        

The Gradient of ReLU

\partialReLU(x)/\partialx = 1 if x > 0 \partialReLU(x)/\partialx = 0 if x \leq 0

ReLU's gradient is either 0 or 1. This is what makes deep networks trainable — the gradient doesn't shrink as it passes through ReLU (unlike sigmoid, where the max gradient is 0.25, causing vanishing gradients in deep networks).

Gradient of Layer Normalization

LayerNorm is more complex than it appears. Given y = γ · (x - μ)/√(σ² + ε) + β:

\partialL/\partialγ = Σᵢ \partialL/\partialyᵢ \cdot (xᵢ - μ)/\sqrt(σ² + ε) \partialL/\partialβ = Σᵢ \partialL/\partialyᵢ \partialL/\partialxᵢ = (γ/\sqrt(σ² + ε)) \cdot [\partialL/\partialyᵢ - (1/d)Σⱼ \partialL/\partialyⱼ - (xᵢ - μ)/(d\cdotσ²) Σⱼ \partialL/\partialyⱼ \cdot (xⱼ - μ)]

The gradient w.r.t. x has three terms: the direct gradient, a correction for the mean, and a correction for the variance. Thankfully, autograd handles this automatically.

Probability: Language Modeling as Probability

The Chain Rule of Probability

The most fundamental equation in autoregressive language modeling. For any joint distribution, we can factorize:

P(x₁, x₂, ..., xₙ) = P(x₁) \cdot P(x₂|x₁) \cdot P(x₃|x₁,x₂) \cdot ... \cdot P(xₙ|x₁,...,xₙ₋₁) = Πᵢ₌₁ⁿ P(xᵢ | x<ᵢ)

This is an identity, not an approximation! Every joint distribution can be decomposed this way. Autoregressive language models parameterize each conditional:

P_θ(xᵢ | x<ᵢ) = softmax(f_θ(x<ᵢ))_xᵢ

where f_θ is the transformer network that produces logits for every token in the vocabulary.

Maximum Likelihood Estimation — Derivation

We want to find θ that maximizes the probability of the training data. For a dataset D = {document₁, document₂, ...}:

          Step 1: Maximize the likelihood of the data

          θ* = argmax_θ Π_{d ∈ D} P_θ(d)

          Step 2: Convert product to sum (log is monotonic, so same optimum)

          θ* = argmax_θ Σ_{d ∈ D} log P_θ(d)

          Step 3: Factorize using chain rule

          θ* = argmax_θ Σ_{d ∈ D} Σᵢ log P_θ(xᵢ^d | x<ᵢ^d)

          Step 4: Flip sign (minimizing negative = maximizing positive)

          θ* = argmin_θ -Σ_{d ∈ D} Σᵢ log P_θ(xᵢ^d | x<ᵢ^d)

          Step 5: Average over tokens

          θ* = argmin_θ -(1/N) Σᵢ log P_θ(xᵢ | x<ᵢ)

          This is the cross-entropy loss! MLE and cross-entropy minimization are equivalent.

Conditional Independence and the Markov Property

Computing P(xᵢ | x<ᵢ) requires conditioning on all previous tokens. This is expensive for long sequences. If we make a Markov assumption:

P(xᵢ | x<ᵢ) \approx P(xᵢ | xᵢ₋ₖ, ..., xᵢ₋₁) (k-th order Markov)

This approximation is used in n-gram models (k = n-1). Transformer models make no Markov assumption — they condition on the full context. This is what gives them their power but also their O(T²) computational cost.

Bayes' Theorem in LLMs

P(hypothesis | data) = P(data | hypothesis) \cdot P(hypothesis) / P(data)

Bayesian thinking appears in several places in LLM training:

Weight decay as a prior: L2 regularization is equivalent to placing a Gaussian prior on weights
Dropout as Bayesian approximation: Monte Carlo dropout approximates Bayesian inference
RLHF reward model: P(human_preference | model_A, model_B) uses Bayesian ranking
In-context learning: The model implicitly performs Bayesian inference over task hypotheses given the prompt

Information Theory: Proofs and Connections

Why Cross-Entropy?

Why is cross-entropy the "right" loss function? It's not arbitrary — it's the unique loss function with certain desirable properties.

Theorem: Cross-Entropy Minimizes KL Divergence

Cross-entropy decomposes as:

H(p, q) = H(p) + D_KL(p || q)

Where H(p) is the entropy of the true distribution (a constant during training) and D_KL(p || q) is the KL divergence between the true and predicted distributions. Since H(p) doesn't depend on θ, minimizing H(p, q) is equivalent to minimizing D_KL(p || q).

Proof:

            H(p, q) = -Σ_x p(x) log q(x)        (definition of cross-entropy)

                  = -Σ_x p(x) log q(x) + Σ_x p(x) log p(x) - Σ_x p(x) log p(x)

                  = [-Σ_x p(x) log p(x)] + [Σ_x p(x) log p(x) - Σ_x p(x) log q(x)]

                  = H(p) + D_KL(p || q)                          ■

Perplexity: Information-Theoretic Interpretation

Perplexity PPL = 2^H(p, q) (for log base 2) has a clean interpretation:

PPL = 1: Perfect prediction (every token gets probability 1). The model is certain about every next token.
PPL = |V|: Uniform distribution. The model is maximally uncertain.
PPL = 10: The model is, on average, as uncertain as if choosing uniformly among 10 tokens.

Perplexity in Practice

Model	Perplexity (Wiki)	Effective Choices
Random (V=50K)	50,000	Every token equally likely
Trigram model	~200	~200 reasonable next words
GPT-2	~30-40	~30-40 words on average
GPT-3	~15-20	~15-20 words on average
GPT-4 (estimated)	~10-15	~10-15 words on average

Mutual Information and Attention

There's a fascinating connection between mutual information and attention:

I(X; Y) = H(X) - H(X|Y) = Σ_{x,y} P(x,y) log(P(x,y) / (P(x)P(y)))

Attention weights can be viewed as estimates of how much mutual information one token shares with another. Tokens that share high mutual information (i.e., knowing one helps predict the other) should have high attention scores.

Optimization Theory: Why Training Works

Convex vs. Non-Convex Landscape

The loss landscape of a neural network is non-convex — it has many local minima, saddle points, and flat regions. Yet neural networks train successfully. Why?

Most local minima are good: In high dimensions, local minima tend to have similar loss values. "Bad" local minima (with much higher loss) are rare.
Saddle points are the real enemy: Most stationary points in high-dimensional spaces are saddle points (not minima). SGD with momentum escapes them effectively.
Gradient noise helps: The stochasticity of mini-batch gradients acts as implicit regularization, helping escape sharp minima and find flat ones.
Overparameterization: Modern networks have far more parameters than needed. This creates many connected "valleys" in the loss landscape, almost all with similar loss.

SGD and Generalization

A surprising empirical finding: SGD with small batches generalizes better than SGD with large batches (even when they see the same total data). Why?

E[\nablaL_B] = \nablaL (unbiased estimate of true gradient) Var[\nablaL_B] = σ²/|B| (variance decreases with batch size)

The noise in small-batch SGD acts as a form of implicit regularization:

It pushes the optimization toward flat minima (wide valleys in loss landscape)
Flat minima generalize better because small perturbations in parameters don't cause large loss increases
Large batches find sharper minima that don't generalize as well

          The Flat Minima Hypothesis: A minimum θ* is "flat" if the loss doesn't change much 
          when parameters are perturbed: L(θ* + ε) ≈ L(θ*) for small ε. Flat minima generalize better because 
          they're robust to the difference between training and test distributions.
        

Convergence Guarantees

For convex functions, SGD with learning rate η_t converges if:

Σ_{t=1}^∞ η_t = ∞      (steps don't decrease too fast) Σ_{t=1}^∞ η_t² < ∞     (steps decrease fast enough)

Common choices: η_t = 1/t (converges but slow) or η_t = 1/√t (converges with rate O(1/√t)). For non-convex functions (neural networks), convergence to a global minimum is not guaranteed, but SGD typically converges to a "good enough" local minimum.

Putting It All Together

These mathematical tools combine to create modern LLMs:

Mathematics in Action

1

Linear Algebra: Attention(Q,K,V) = softmax(QK^T/√d)V
Matrix operations process the entire sequence in parallel

2

Calculus: ∇L via backpropagation
Chain rule computes gradients for billions of parameters

3

Probability: P(xₜ₊₁ | x≤ₜ)
Model learns conditional distributions for next-token prediction

4

Information Theory: Cross-entropy loss
Measures how well model predicts true next tokens

5

Optimization: Adam minimizes loss
Stochastic gradient descent with adaptive learning rates

Complete Training Loop — Mathematical View

          For each training step t:

          1. Sample batch B = {x₁, ..., x_b} from training data

          2. Forward pass: Compute P_θ(xᵢ | x<ᵢ) for each position in each sequence

          3. Compute loss: L = -(1/|B|) Σ_{b∈B} Σᵢ log P_θ(xᵢᵇ | x<ᵢᵇ)

          4. Backward pass: Compute ∂L/∂θ for all parameters using chain rule

          5. Update parameters:

             m_t = β₁m_{t-1} + (1-β₁)∂L/∂θ

             v_t = β₂v_{t-1} + (1-β₂)(∂L/∂θ)²

             θ ← θ - η · m̂_t/(√v̂_t + ε)

          6. Repeat for millions of steps

          The Beautiful Unity: These aren't separate topics — they're different perspectives 
          on the same system. Linear algebra describes the computation; calculus enables learning; 
          probability models uncertainty; information theory measures success.