Level 03: The Transformer

The Attention Revolution

In 2017, Google researchers published "Attention Is All You Need." This paper introduced the Transformer architecture and changed AI forever. Before transformers, sequence models used recurrence (RNNs, LSTMs) — processing words one at a time. Transformers process the entire sequence at once using attention.

          Key Innovation: Attention allows every token to directly "look at" every other token. 
          No more processing words sequentially — the entire context is available simultaneously.
        

The Problem with Recurrence

RNNs process sequences like this:

          h₁ = f(x₁, h₀)

          h₂ = f(x₂, h₁)

          h₃ = f(x₃, h₂)

          ...

          h₁₀₀₀ = f(x₁₀₀₀, h₉₉₉) ← Must wait for all previous steps!

This is slow (can't parallelize) and forgetful (information from early steps gets diluted). Attention solves both problems.

Attention: The Core Idea

Attention answers the question: "When processing this word, which other words should I pay attention to?"

Attention Example

Consider the sentence: "The animal didn't cross the street because it was too tired."

What does "it" refer to? The model needs to attend to "animal":

The

animal

didn't

cross

it

0.02

0.75

0.05

0.03

0.15

Attention weights for "it" — highest attention on "animal" (75%)

In traditional models, "it" would only have access to the immediately previous hidden state. With attention, it can directly "look at" "animal" even though they're 5 words apart.

Query, Key, Value

Attention is implemented using three projections of each input: Query, Key, and Value. Think of it like a database lookup:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

QKV Intuition

Query

"I'm 'it'. What am I?"

Key

Every token's identity

Value

Information to retrieve

Query matches with Keys → determines which Values to retrieve

The Attention Formula

Attention(Q, K, V) = softmax(QK^T / \sqrtdₖ) \cdot V

Breaking this down:

QK^T: Compute similarity between every Query and every Key (dot product)
/ √dₖ: Scale by square root of key dimension (prevents softmax saturation)
softmax: Convert to probabilities (sum to 1)
· V: Weighted sum of Values based on attention scores

Multi-Head Attention

Different words might relate to each other in different ways. "It" might relate to "animal" grammatically, but also relate to "tired" semantically. Multi-head attention runs multiple attention operations in parallel, each learning different types of relationships.

8 Attention Heads (GPT-style)

Each head learns different relationship types:

Head 1
Subject-verb

Head 2
Coreference

Head 3
Modifier-head

Head 4
Positional

Head 5
Syntactic

Head 6
Semantic

Head 7
Rare patterns

Head 8
Rare patterns

Heads specialize organically during training — some track grammar, others track meaning

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) \cdot W^O where headᵢ = Attention(Q \cdot Wᵢ^Q, K \cdot Wᵢ^K, V \cdot Wᵢ^V)

The Transformer Block

A transformer is built by stacking identical blocks. Each block contains:

Transformer Block Structure

Layer Norm

↓

Multi-Head Attention

↓

+ Residual Connection

↓

Layer Norm

↓

Feed-Forward Network

↓

+ Residual Connection

Key Components

Multi-Head Attention: Allows tokens to communicate with each other
Feed-Forward Network: Processes each token independently (applies non-linearity)
Layer Norm: Normalizes activations for stable training
Residual Connections: Skip connections that help gradients flow

          GPT Architecture: Decoder-only transformer with causal (left-to-right) attention. 
          GPT-3 has 96 layers, each with 96 attention heads, processing sequences up to 2048 tokens.
        

Positional Encoding

Attention itself is position-agnostic — it doesn't know where words are in the sentence. "Dog bites man" and "Man bites dog" would look the same! We need to inject position information.

Sinusoidal Position Encoding

The original transformer uses sine and cosine functions of different frequencies:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This creates a unique "fingerprint" for each position that the model can learn to interpret. Modern models often use learned position embeddings instead — just another embedding layer for position indices.

Scaled Dot-Product Attention: The Math

Let's derive the attention formula step by step. Understanding each component is essential for understanding how transformers work.

Step 1: Computing Similarity Scores

First, we compute how much each Query matches each Key using a dot product:

scores = QK^T (shape: seq_len \times seq_len)

Each element scores[i][j] tells us how much token i should attend to token j. But raw dot products can be very large (especially in high dimensions), causing softmax to saturate.

Step 2: Scaling by √d_k

We divide by the square root of the key dimension:

scaled_scores = scores / \sqrtd_k

          Why scale? If Q and K have dimension d_k, their dot product has variance 
          approximately d_k (assuming each component has unit variance). Without scaling, 
          for d_k = 64, the variance would be 64, leading to very large values that push softmax 
          into regions with near-zero gradients. Dividing by √d_k normalizes the variance back to 1.
        

Step 3: Softmax — Converting to Probabilities

attention_weights = softmax(scaled_scores, dim=-1)

Softmax converts raw scores to probabilities that sum to 1 for each query position:

softmax(z)_i = exp(z_i) / Σⱼ exp(z_j)

Each row of attention_weights tells us the probability distribution over all positions for a given query. Position i distributes its "attention budget" across positions 1 through seq_len.

Step 4: Weighted Aggregation

output = attention_weights \cdot V (shape: seq_len \times d_v)

Finally, we multiply the attention weights by the Value matrix. Each output position is a weighted combination of all values, where the weights come from the attention scores.

Numerical Example

Let's trace attention through a tiny example with 3 tokens and d_k = 4:

          Query for token 2: q₂ = [0.5, -0.3, 0.8, 0.1]

          Keys:

            k₁ = [0.7, -0.2, 0.4, 0.3] (for "The")

            k₂ = [0.1, 0.6, -0.5, 0.2] (for "cat")

            k₃ = [0.3, -0.4, 0.9, 0.7] (for "sat")

          Step 1: Dot products

            q₂·k₁ = 0.5×0.7 + (-0.3)×(-0.2) + 0.8×0.4 + 0.1×0.3 = 0.72

            q₂·k₂ = 0.5×0.1 + (-0.3)×0.6 + 0.8×(-0.5) + 0.1×0.2 = -0.53

            q₂·k₃ = 0.5×0.3 + (-0.3)×(-0.4) + 0.8×0.9 + 0.1×0.7 = 1.02

          Step 2: Scale by √4 = 2

            scaled = [0.36, -0.265, 0.51]

          Step 3: Softmax

            exp(0.36) = 1.433, exp(-0.265) = 0.767, exp(0.51) = 1.665

            sum = 3.865

            attention = [0.371, 0.199, 0.431]

          Interpretation: Token 2 ("cat") attends 43.1% to "sat", 37.1% to "The", and 19.9% to itself.

Causal Masking: Preventing Cheating

In autoregressive language models (like GPT), we must prevent tokens from "seeing the future." During training, all positions are computed simultaneously for efficiency, but token i should only attend to tokens 1 through i — not tokens after it.

The Causal Mask

Before applying softmax, we set all "future" attention scores to negative infinity:

masked_scores[i][j] = scores[i][j] if j \leq i masked_scores[i][j] = -\infty if j > i

This ensures that after softmax, positions after i receive 0 attention:

Causal Mask (5 tokens)

T₁
T₂
T₃
T₄
T₅
T₁
✓
✗
✗
✗
✗
T₂
✓
✓
✗
✗
✗
T₃
✓
✓
✓
✗
✗
T₄
✓
✓
✓
✓
✗
T₅
✓
✓
✓
✓
✓

                ✓ = can attend to    ✗ = masked out (-∞)
              

This lower-triangular mask is called the causal mask or autoregressive mask. Without it, the model could "cheat" by looking at future tokens during training.

          In Code: In PyTorch, this is typically implemented as:
          
            mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
          
          This creates an upper-triangular mask where positions above the diagonal are True (to be masked).

Layer Normalization & Residual Connections

Two critical components that make deep transformers trainable: layer normalization stabilizes activations, and residual connections allow gradients to flow.

Layer Normalization (LayerNorm)

LayerNorm normalizes the activations of each token independently:

LayerNorm(x) = γ ⊙ ((x - μ) / \sqrt(σ² + ε)) + β

Where μ and σ² are the mean and variance computed over the d_model dimensions of each token. γ and β are learnable parameters that allow the network to undo the normalization if needed.

          LayerNorm vs. BatchNorm: BatchNorm computes statistics over the batch dimension, 
          which creates dependency between samples and doesn't work well for variable-length sequences. 
          LayerNorm computes statistics over the feature dimension for each token independently — 
          perfect for sequences. RMSNorm (used in LLaMA) is a simplified variant 
          that removes the mean-centering and learnable bias.
        

Residual (Skip) Connections

Residual connections add the input directly to the output of a sublayer:

output = x + Sublayer(x)

This means the sublayer only needs to learn the residual — the difference between the output and input. This has two crucial benefits:

Gradient flow: Gradients can skip the sublayer entirely, ensuring they reach earlier layers even when the sublayer's gradient is small
Identity initialization: At the start of training, Sublayer(x) ≈ 0, so output ≈ x. The network starts as an identity function and gradually learns transformations

Full Transformer Block

Combining everything, a complete transformer block (GPT-style) looks like this:

x₁ = x + MultiHeadAttention(LayerNorm(x)) x₂ = x₁ + FeedForward(LayerNorm(x₁))

This is called Pre-LayerNorm (LayerNorm before the sublayer). The original transformer paper used Post-LayerNorm (after the sublayer), but Pre-LN trains more stably.

The Feed-Forward Network (FFN)

After attention allows tokens to communicate, each token is processed independently by the feed-forward network (also called the MLP):

FFN(x) = W₂ \cdot GELU(W₁ \cdot x + b₁) + b₂

Where W₁ ∈ ℝ^(d_model × d_ff), W₂ ∈ ℝ^(d_ff × d_model). The FFN has two linear transformations with a GELU activation in between.

Why d_ff = 4 × d_model?

The inner dimension d_ff is typically 4× the model dimension. This means:

GPT-2 Small (d=768): d_ff = 3072
GPT-3 (d=12288): d_ff = 49152
LLaMA 2 70B (d=8192): d_ff = 28672 (3.5×, with SwiGLU variant)

Most of the transformer's parameters are in the FFN — roughly 2/3 of total parameters. Attention allows tokens to share information; the FFN is where knowledge is stored and processed.

SwiGLU: The Modern FFN

Modern transformers (like LLaMA) use SwiGLU instead of the standard FFN:

SwiGLU(x) = (x \cdot W₁) ⊙ SiLU(x \cdot V) then W₂ \cdot SwiGLU(x)

Where SiLU(x) = x · σ(x) is the sigmoid linear unit. This gating mechanism (similar to LSTM gates) allows the network to selectively pass information through, which has been shown to improve performance over ReLU and GELU.

Inference: The KV Cache

During text generation, the model produces one token at a time autoregressively. Without optimization, generating token N requires recomputing attention over all N-1 previous tokens. This is extremely wasteful!

The Key Insight

In causal attention, the Key and Value matrices for tokens 1 through N-1 don't change when we add token N. They've already been computed! We can cache them.

KV Cache Comparison

Without KV Cache

                Token 1: Compute K₁, V₁, Q₁

                Token 2: Compute K₁, V₁, K₂, V₂, Q₂

                Token 3: Compute K₁, V₁, K₂, V₂, K₃, V₃, Q₃

                Token N: Compute all K, V, Q

                Total: O(N²) recomputations

With KV Cache

                Token 1: Compute K₁, V₁, Q₁

                Token 2: Load K₁, V₁, Compute K₂, V₂, Q₂

                Token 3: Load K₁-V₂, Compute K₃, V₃, Q₃

                Token N: Load K₁-V_{N-1}, Compute K_N, V_N, Q_N

                Total: O(N) new computations

Memory Cost

The KV cache stores 2 matrices per layer, each of size (seq_len × d_head × n_heads):

KV Cache Size = 2 \times n_layers \times n_heads \times d_head \times seq_len \times 2 bytes (FP16)

For GPT-3 (96 layers, 96 heads, 128 dim, 2048 seq len):

KV Cache = 2 \times 96 \times 96 \times 128 \times 2048 \times 2 = ~9.6 GB per sequence

          Practical Impact: The KV cache is the primary memory bottleneck during inference. 
          With a 128K context window, a single request can require hundreds of GB of memory for the KV cache. 
          This is why efficient KV cache management (paging, quantization, compression) is an active 
          area of research and engineering.
        

The Full GPT Architecture

Putting all the pieces together, here's the complete forward pass through a GPT model:

GPT Forward Pass

1. Token Embedding
x = Embedding[token_ids] (V × d → T × d)

+

2. Position Embedding
x = x + PositionEmbed[pos] (T × d → T × d)

↓ × L layers

3. Transformer Block

                a = x + CausalMultiHeadAttention(LayerNorm(x))

                x = a + FeedForward(LayerNorm(a))

↓

4. Final LayerNorm
x = LayerNorm(x)

↓

5. Language Model Head
logits = x · W_out (T × d → T × V)

↓

6. Predict Next Token
P(next_token) = softmax(logits[-1])

Parameter Count

Let's count the parameters for a transformer block with d_model = d, n_heads = h, and inner dimension d_ff = 4d:

Component	Parameters	Example (d=768)
Q, K, V projections (each)	d² + d	768² + 768 = 590,592
Output projection	d² + d	590,592
FFN: W₁	d × 4d + 4d	768 × 3072 + 3072 = 2,362,368
FFN: W₂	4d × d + d	3072 × 768 + 768 = 2,360,064
LayerNorm × 2	2 × 2d	3,072
Per Block Total	~6d²	~7,074,000

For GPT-2 Small (12 layers, d=768): ~85M in transformer blocks + ~7M in embeddings + ~7M in output head = ~124M total parameters. GPT-3 uses 96 layers with d=12,288 for ~175B parameters.

Supplementary Materials

Deep dives into key transformer concepts with interactive examples and code:

Additional Resources

📚 Causal Masking Deep Dive

Interactive visualizations, PyTorch implementations, and exercises on causal masking for autoregressive models.

📚 Training Transformers at Scale

Learning rate schedules, AdamW optimization, gradient clipping, mixed precision, and distributed training strategies.