🚧 Lesson 10 of 25 in Level 05
Level 05 • Lesson 10

The Complete Picture

Mathematical derivation of transformer training.

Complete Transformer Math

Putting it all together:

# 1. Embedding X = Embed(tokens) + PositionalEncoding(pos) # 2. Self-Attention Q, K, V = XW_Q, XW_K, XW_V A = softmax(QK^T/√d_k)V # 3. FFN FFN(x) = W_2 GELU(W_1 x + b_1) + b_2 # 4. Layer Norm + Residual output = LayerNorm(x + Sublayer(x)) # 5. Output projection logits = W_out @ final_hidden # 6. Loss (cross-entropy) L = -log softmax(logits)[target_token]

Training Objective

# Minimize negative log-likelihood min_θ -sum_t log P(x_t | x_

Gradients Flow

Backpropagation computes ∂L/∂θ for all parameters using the chain rule through every operation.

Congratulations! You now understand the complete mathematical foundation of LLMs.

📝 Knowledge Check

Question 1: What is the purpose of dividing by √d_k in the attention formula?

Answer: To scale the dot products and prevent the softmax from entering regions with extremely small gradients. Without scaling, large dot products would push softmax into saturation, causing vanishing gradients.

Question 2: In the training objective, what does θ represent?

Answer: θ represents all the trainable parameters of the model (weights and biases in embeddings, attention, FFN, and output layers). The optimization process seeks to find θ that minimizes the negative log-likelihood.

Question 3: Why is LayerNorm applied before the residual connection in modern transformers?

Answer: Pre-normalization (LayerNorm before the sublayer) stabilizes training by keeping inputs to attention and FFN in a consistent range. This prevents gradient explosion and allows training deeper models.

Question 4: What activation function is typically used in the transformer FFN?

Answer: GELU (Gaussian Error Linear Unit). Unlike ReLU, GELU smoothly zeros out negative values, which provides better gradients and has been empirically shown to improve transformer performance.

Question 5: How does backpropagation compute gradients through the transformer?

Answer: Using the chain rule. Gradients flow backward from the loss through: output projection → final LayerNorm → all transformer layers (in reverse) → embeddings. Each operation's local Jacobian is multiplied to compute ∂L/∂θ.

🎯 Additional Quiz

Question 6: What is the computational complexity of self-attention with respect to sequence length?

Answer: O(n²) where n is the sequence length. Each token attends to every other token, creating a quadratic cost. This is why transformers struggle with very long sequences compared to RNNs (which are O(n)).

Question 7: Why do we use positional encoding instead of just feeding token positions directly to the model?

Answer: Direct position indices would imply an ordinal relationship (position 5 > position 3) that doesn't exist in language. Sinusoidal encodings encode relative positions through frequency patterns, allowing the model to learn positional relationships without imposing artificial ordering.

Question 8: What happens during inference that differs from training in terms of attention?

Answer: During inference, we use causal (autoregressive) masking to prevent attending to future tokens. During training, we can process the entire sequence at once using teacher forcing, but masking still ensures each position only attends to previous positions.

💻 Coding Exercises

Exercise 1: Implement Scaled Dot-Product Attention

Complete the attention function that computes Q·K^T, scales by √d_k, applies softmax, and multiplies by V.

import numpy as np import torch import torch.nn.functional as F def scaled_dot_product_attention(Q, K, V, mask=None): """ Q, K, V: tensors of shape (batch, seq_len, d_k) mask: optional tensor of shape (batch, 1, seq_len, seq_len) Returns: attention output and attention weights """ d_k = Q.size(-1) # TODO: Compute Q·K^T scores = ... # TODO: Scale by sqrt(d_k) scores = ... # Apply mask if provided (set masked positions to -inf) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) # TODO: Apply softmax attn_weights = ... # TODO: Multiply by V output = ... return output, attn_weights # Test your implementation batch, seq_len, d_k = 2, 4, 8 Q = torch.randn(batch, seq_len, d_k) K = torch.randn(batch, seq_len, d_k) V = torch.randn(batch, seq_len, d_k) output, weights = scaled_dot_product_attention(Q, K, V) print(f"Output shape: {output.shape}") # Should be (2, 4, 8) print(f"Weights shape: {weights.shape}") # Should be (2, 4, 4) print(f"Weights sum to 1: {torch.allclose(weights.sum(dim=-1), torch.ones(batch, seq_len))}")

Solution:

scores = torch.matmul(Q, K.transpose(-2, -1)) scores = scores / np.sqrt(d_k) attn_weights = F.softmax(scores, dim=-1) output = torch.matmul(attn_weights, V)

Exercise 2: Compute Cross-Entropy Loss for Language Modeling

Implement the training loss calculation for a transformer language model given logits and target tokens.

def compute_lm_loss(logits, targets, ignore_index=-100): """ logits: tensor of shape (batch, seq_len, vocab_size) targets: tensor of shape (batch, seq_len) ignore_index: token ID to ignore in loss calculation Returns: scalar loss value """ batch_size, seq_len, vocab_size = logits.shape # TODO: Reshape logits to (-1, vocab_size) and targets to (-1) logits_flat = ... targets_flat = ... # TODO: Compute cross-entropy loss using F.cross_entropy loss = ... return loss # Test your implementation batch, seq_len, vocab_size = 2, 10, 1000 logits = torch.randn(batch, seq_len, vocab_size) targets = torch.randint(0, vocab_size, (batch, seq_len)) loss = compute_lm_loss(logits, targets) print(f"Loss value: {loss.item():.4f}") # Test with padding tokens targets_with_pad = targets.clone() targets_with_pad[0, 5:] = -100 # Mask out positions 5-9 in first sequence loss_with_pad = compute_lm_loss(logits, targets_with_pad, ignore_index=-100) print(f"Loss with padding: {loss_with_pad.item():.4f}")

Solution:

logits_flat = logits.view(-1, vocab_size) targets_flat = targets.view(-1) loss = F.cross_entropy(logits_flat, targets_flat, ignore_index=ignore_index)