Complete Transformer Math
Putting it all together:
Training Objective
Gradients Flow
Backpropagation computes ∂L/∂θ for all parameters using the chain rule through every operation.
📝 Knowledge Check
Question 1: What is the purpose of dividing by √d_k in the attention formula?
Answer: To scale the dot products and prevent the softmax from entering regions with extremely small gradients. Without scaling, large dot products would push softmax into saturation, causing vanishing gradients.
Question 2: In the training objective, what does θ represent?
Answer: θ represents all the trainable parameters of the model (weights and biases in embeddings, attention, FFN, and output layers). The optimization process seeks to find θ that minimizes the negative log-likelihood.
Question 3: Why is LayerNorm applied before the residual connection in modern transformers?
Answer: Pre-normalization (LayerNorm before the sublayer) stabilizes training by keeping inputs to attention and FFN in a consistent range. This prevents gradient explosion and allows training deeper models.
Question 4: What activation function is typically used in the transformer FFN?
Answer: GELU (Gaussian Error Linear Unit). Unlike ReLU, GELU smoothly zeros out negative values, which provides better gradients and has been empirically shown to improve transformer performance.
Question 5: How does backpropagation compute gradients through the transformer?
Answer: Using the chain rule. Gradients flow backward from the loss through: output projection → final LayerNorm → all transformer layers (in reverse) → embeddings. Each operation's local Jacobian is multiplied to compute ∂L/∂θ.
🎯 Additional Quiz
Question 6: What is the computational complexity of self-attention with respect to sequence length?
Answer: O(n²) where n is the sequence length. Each token attends to every other token, creating a quadratic cost. This is why transformers struggle with very long sequences compared to RNNs (which are O(n)).
Question 7: Why do we use positional encoding instead of just feeding token positions directly to the model?
Answer: Direct position indices would imply an ordinal relationship (position 5 > position 3) that doesn't exist in language. Sinusoidal encodings encode relative positions through frequency patterns, allowing the model to learn positional relationships without imposing artificial ordering.
Question 8: What happens during inference that differs from training in terms of attention?
Answer: During inference, we use causal (autoregressive) masking to prevent attending to future tokens. During training, we can process the entire sequence at once using teacher forcing, but masking still ensures each position only attends to previous positions.
💻 Coding Exercises
Exercise 1: Implement Scaled Dot-Product Attention
Complete the attention function that computes Q·K^T, scales by √d_k, applies softmax, and multiplies by V.
Solution:
Exercise 2: Compute Cross-Entropy Loss for Language Modeling
Implement the training loss calculation for a transformer language model given logits and target tokens.
Solution: