🚧 Level 3 β€’ The Transformer Architecture
Level 03

The Transformer

The architecture that changed everything. Master attention, self-attention, multi-head attention, and understand how GPT, BERT, and modern LLMs work.

35 Lessons Intermediate

The Attention Revolution

In 2017, Google researchers published "Attention Is All You Need." This paper introduced the Transformer architecture and changed AI forever. Before transformers, sequence models used recurrence (RNNs, LSTMs) β€” processing words one at a time. Transformers process the entire sequence at once using attention.

Key Innovation: Attention allows every token to directly "look at" every other token. No more processing words sequentially β€” the entire context is available simultaneously.

The Problem with Recurrence

RNNs process sequences like this:

h₁ = f(x₁, hβ‚€)
hβ‚‚ = f(xβ‚‚, h₁)
h₃ = f(x₃, hβ‚‚)
...
h₁₀₀₀ = f(x₁₀₀₀, h₉₉₉) ← Must wait for all previous steps!

This is slow (can't parallelize) and forgetful (information from early steps gets diluted). Attention solves both problems.

Attention: The Core Idea

Attention answers the question: "When processing this word, which other words should I pay attention to?"

Attention Example

Consider the sentence: "The animal didn't cross the street because it was too tired."

What does "it" refer to? The model needs to attend to "animal":

The
animal
didn't
cross
it
0.02
0.75
0.05
0.03
0.15

Attention weights for "it" β€” highest attention on "animal" (75%)

In traditional models, "it" would only have access to the immediately previous hidden state. With attention, it can directly "look at" "animal" even though they're 5 words apart.

Query, Key, Value

Attention is implemented using three projections of each input: Query, Key, and Value. Think of it like a database lookup:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I provide?"

QKV Intuition

Query
"I'm 'it'. What am I?"
Key
Every token's identity
Value
Information to retrieve

Query matches with Keys β†’ determines which Values to retrieve

The Attention Formula

Attention(Q, K, V) = softmax(QK^T / √dβ‚–) Β· V

Breaking this down:

  1. QK^T: Compute similarity between every Query and every Key (dot product)
  2. / √dβ‚–: Scale by square root of key dimension (prevents softmax saturation)
  3. softmax: Convert to probabilities (sum to 1)
  4. Β· V: Weighted sum of Values based on attention scores

Multi-Head Attention

Different words might relate to each other in different ways. "It" might relate to "animal" grammatically, but also relate to "tired" semantically. Multi-head attention runs multiple attention operations in parallel, each learning different types of relationships.

8 Attention Heads (GPT-style)

Each head learns different relationship types:

Head 1
Subject-verb
Head 2
Coreference
Head 3
Modifier-head
Head 4
Positional
Head 5
Syntactic
Head 6
Semantic
Head 7
Rare patterns
Head 8
Rare patterns

Heads specialize organically during training β€” some track grammar, others track meaning

MultiHead(Q, K, V) = Concat(head₁, ..., headβ‚•) Β· W^O

where headα΅’ = Attention(Q Β· Wα΅’^Q, K Β· Wα΅’^K, V Β· Wα΅’^V)

The Transformer Block

A transformer is built by stacking identical blocks. Each block contains:

Transformer Block Structure

Layer Norm
↓
Multi-Head Attention
↓
+ Residual Connection
↓
Layer Norm
↓
Feed-Forward Network
↓
+ Residual Connection

Key Components

  • Multi-Head Attention: Allows tokens to communicate with each other
  • Feed-Forward Network: Processes each token independently (applies non-linearity)
  • Layer Norm: Normalizes activations for stable training
  • Residual Connections: Skip connections that help gradients flow
GPT Architecture: Decoder-only transformer with causal (left-to-right) attention. GPT-3 has 96 layers, each with 96 attention heads, processing sequences up to 2048 tokens.

Positional Encoding

Attention itself is position-agnostic β€” it doesn't know where words are in the sentence. "Dog bites man" and "Man bites dog" would look the same! We need to inject position information.

Sinusoidal Position Encoding

The original transformer uses sine and cosine functions of different frequencies:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This creates a unique "fingerprint" for each position that the model can learn to interpret. Modern models often use learned position embeddings instead β€” just another embedding layer for position indices.

Scaled Dot-Product Attention: The Math

Let's derive the attention formula step by step. Understanding each component is essential for understanding how transformers work.

Step 1: Computing Similarity Scores

First, we compute how much each Query matches each Key using a dot product:

scores = QK^T   (shape: seq_len Γ— seq_len)

Each element scores[i][j] tells us how much token i should attend to token j. But raw dot products can be very large (especially in high dimensions), causing softmax to saturate.

Step 2: Scaling by √d_k

We divide by the square root of the key dimension:

scaled_scores = scores / √d_k
Why scale? If Q and K have dimension d_k, their dot product has variance approximately d_k (assuming each component has unit variance). Without scaling, for d_k = 64, the variance would be 64, leading to very large values that push softmax into regions with near-zero gradients. Dividing by √d_k normalizes the variance back to 1.

Step 3: Softmax β€” Converting to Probabilities

attention_weights = softmax(scaled_scores, dim=-1)

Softmax converts raw scores to probabilities that sum to 1 for each query position:

softmax(z)_i = exp(z_i) / Ξ£β±Ό exp(z_j)

Each row of attention_weights tells us the probability distribution over all positions for a given query. Position i distributes its "attention budget" across positions 1 through seq_len.

Step 4: Weighted Aggregation

output = attention_weights Β· V   (shape: seq_len Γ— d_v)

Finally, we multiply the attention weights by the Value matrix. Each output position is a weighted combination of all values, where the weights come from the attention scores.

Numerical Example

Let's trace attention through a tiny example with 3 tokens and d_k = 4:

Query for token 2: qβ‚‚ = [0.5, -0.3, 0.8, 0.1]
Keys:
  k₁ = [0.7, -0.2, 0.4, 0.3] (for "The")
  kβ‚‚ = [0.1, 0.6, -0.5, 0.2] (for "cat")
  k₃ = [0.3, -0.4, 0.9, 0.7] (for "sat")

Step 1: Dot products
  qβ‚‚Β·k₁ = 0.5Γ—0.7 + (-0.3)Γ—(-0.2) + 0.8Γ—0.4 + 0.1Γ—0.3 = 0.72
  qβ‚‚Β·kβ‚‚ = 0.5Γ—0.1 + (-0.3)Γ—0.6 + 0.8Γ—(-0.5) + 0.1Γ—0.2 = -0.53
  qβ‚‚Β·k₃ = 0.5Γ—0.3 + (-0.3)Γ—(-0.4) + 0.8Γ—0.9 + 0.1Γ—0.7 = 1.02

Step 2: Scale by √4 = 2
  scaled = [0.36, -0.265, 0.51]

Step 3: Softmax
  exp(0.36) = 1.433, exp(-0.265) = 0.767, exp(0.51) = 1.665
  sum = 3.865
  attention = [0.371, 0.199, 0.431]

Interpretation: Token 2 ("cat") attends 43.1% to "sat", 37.1% to "The", and 19.9% to itself.

Causal Masking: Preventing Cheating

In autoregressive language models (like GPT), we must prevent tokens from "seeing the future." During training, all positions are computed simultaneously for efficiency, but token i should only attend to tokens 1 through i β€” not tokens after it.

The Causal Mask

Before applying softmax, we set all "future" attention scores to negative infinity:

masked_scores[i][j] = scores[i][j]    if j ≀ i
masked_scores[i][j] = -∞          if j > i

This ensures that after softmax, positions after i receive 0 attention:

Causal Mask (5 tokens)

T₁
Tβ‚‚
T₃
Tβ‚„
Tβ‚…
T₁
βœ“
βœ—
βœ—
βœ—
βœ—
Tβ‚‚
βœ“
βœ“
βœ—
βœ—
βœ—
T₃
βœ“
βœ“
βœ“
βœ—
βœ—
Tβ‚„
βœ“
βœ“
βœ“
βœ“
βœ—
Tβ‚…
βœ“
βœ“
βœ“
βœ“
βœ“

βœ“ = can attend to    βœ— = masked out (-∞)

This lower-triangular mask is called the causal mask or autoregressive mask. Without it, the model could "cheat" by looking at future tokens during training.

In Code: In PyTorch, this is typically implemented as:

mask = torch.triu(torch.ones(T, T), diagonal=1).bool()

This creates an upper-triangular mask where positions above the diagonal are True (to be masked).

Layer Normalization & Residual Connections

Two critical components that make deep transformers trainable: layer normalization stabilizes activations, and residual connections allow gradients to flow.

Layer Normalization (LayerNorm)

LayerNorm normalizes the activations of each token independently:

LayerNorm(x) = Ξ³ βŠ™ ((x - ΞΌ) / √(σ² + Ξ΅)) + Ξ²

Where ΞΌ and σ² are the mean and variance computed over the d_model dimensions of each token. Ξ³ and Ξ² are learnable parameters that allow the network to undo the normalization if needed.

LayerNorm vs. BatchNorm: BatchNorm computes statistics over the batch dimension, which creates dependency between samples and doesn't work well for variable-length sequences. LayerNorm computes statistics over the feature dimension for each token independently β€” perfect for sequences. RMSNorm (used in LLaMA) is a simplified variant that removes the mean-centering and learnable bias.

Residual (Skip) Connections

Residual connections add the input directly to the output of a sublayer:

output = x + Sublayer(x)

This means the sublayer only needs to learn the residual β€” the difference between the output and input. This has two crucial benefits:

  • Gradient flow: Gradients can skip the sublayer entirely, ensuring they reach earlier layers even when the sublayer's gradient is small
  • Identity initialization: At the start of training, Sublayer(x) β‰ˆ 0, so output β‰ˆ x. The network starts as an identity function and gradually learns transformations

Full Transformer Block

Combining everything, a complete transformer block (GPT-style) looks like this:

x₁ = x + MultiHeadAttention(LayerNorm(x))
xβ‚‚ = x₁ + FeedForward(LayerNorm(x₁))

This is called Pre-LayerNorm (LayerNorm before the sublayer). The original transformer paper used Post-LayerNorm (after the sublayer), but Pre-LN trains more stably.

The Feed-Forward Network (FFN)

After attention allows tokens to communicate, each token is processed independently by the feed-forward network (also called the MLP):

FFN(x) = Wβ‚‚ Β· GELU(W₁ Β· x + b₁) + bβ‚‚

Where W₁ ∈ ℝ^(d_model Γ— d_ff), Wβ‚‚ ∈ ℝ^(d_ff Γ— d_model). The FFN has two linear transformations with a GELU activation in between.

Why d_ff = 4 Γ— d_model?

The inner dimension d_ff is typically 4Γ— the model dimension. This means:

  • GPT-2 Small (d=768): d_ff = 3072
  • GPT-3 (d=12288): d_ff = 49152
  • LLaMA 2 70B (d=8192): d_ff = 28672 (3.5Γ—, with SwiGLU variant)

Most of the transformer's parameters are in the FFN β€” roughly 2/3 of total parameters. Attention allows tokens to share information; the FFN is where knowledge is stored and processed.

SwiGLU: The Modern FFN

Modern transformers (like LLaMA) use SwiGLU instead of the standard FFN:

SwiGLU(x) = (x Β· W₁) βŠ™ SiLU(x Β· V)   then   Wβ‚‚ Β· SwiGLU(x)

Where SiLU(x) = x Β· Οƒ(x) is the sigmoid linear unit. This gating mechanism (similar to LSTM gates) allows the network to selectively pass information through, which has been shown to improve performance over ReLU and GELU.

Inference: The KV Cache

During text generation, the model produces one token at a time autoregressively. Without optimization, generating token N requires recomputing attention over all N-1 previous tokens. This is extremely wasteful!

The Key Insight

In causal attention, the Key and Value matrices for tokens 1 through N-1 don't change when we add token N. They've already been computed! We can cache them.

KV Cache Comparison

Without KV Cache
Token 1: Compute K₁, V₁, Q₁
Token 2: Compute K₁, V₁, Kβ‚‚, Vβ‚‚, Qβ‚‚
Token 3: Compute K₁, V₁, Kβ‚‚, Vβ‚‚, K₃, V₃, Q₃
Token N: Compute all K, V, Q

Total: O(NΒ²) recomputations
With KV Cache
Token 1: Compute K₁, V₁, Q₁
Token 2: Load K₁, V₁, Compute Kβ‚‚, Vβ‚‚, Qβ‚‚
Token 3: Load K₁-Vβ‚‚, Compute K₃, V₃, Q₃
Token N: Load K₁-V_{N-1}, Compute K_N, V_N, Q_N

Total: O(N) new computations

Memory Cost

The KV cache stores 2 matrices per layer, each of size (seq_len Γ— d_head Γ— n_heads):

KV Cache Size = 2 Γ— n_layers Γ— n_heads Γ— d_head Γ— seq_len Γ— 2 bytes (FP16)

For GPT-3 (96 layers, 96 heads, 128 dim, 2048 seq len):

KV Cache = 2 Γ— 96 Γ— 96 Γ— 128 Γ— 2048 Γ— 2 = ~9.6 GB per sequence
Practical Impact: The KV cache is the primary memory bottleneck during inference. With a 128K context window, a single request can require hundreds of GB of memory for the KV cache. This is why efficient KV cache management (paging, quantization, compression) is an active area of research and engineering.

The Full GPT Architecture

Putting all the pieces together, here's the complete forward pass through a GPT model:

GPT Forward Pass

1. Token Embedding
x = Embedding[token_ids]   (V Γ— d β†’ T Γ— d)
+
2. Position Embedding
x = x + PositionEmbed[pos]   (T Γ— d β†’ T Γ— d)
↓ Γ— L layers
3. Transformer Block
a = x + CausalMultiHeadAttention(LayerNorm(x))
x = a + FeedForward(LayerNorm(a))
↓
4. Final LayerNorm
x = LayerNorm(x)
↓
5. Language Model Head
logits = x Β· W_out   (T Γ— d β†’ T Γ— V)
↓
6. Predict Next Token
P(next_token) = softmax(logits[-1])

Parameter Count

Let's count the parameters for a transformer block with d_model = d, n_heads = h, and inner dimension d_ff = 4d:

Component Parameters Example (d=768)
Q, K, V projections (each) dΒ² + d 768Β² + 768 = 590,592
Output projection dΒ² + d 590,592
FFN: W₁ d Γ— 4d + 4d 768 Γ— 3072 + 3072 = 2,362,368
FFN: Wβ‚‚ 4d Γ— d + d 3072 Γ— 768 + 768 = 2,360,064
LayerNorm Γ— 2 2 Γ— 2d 3,072
Per Block Total ~6dΒ² ~7,074,000

For GPT-2 Small (12 layers, d=768): ~85M in transformer blocks + ~7M in embeddings + ~7M in output head = ~124M total parameters. GPT-3 uses 96 layers with d=12,288 for ~175B parameters.

Supplementary Materials

Deep dives into key transformer concepts with interactive examples and code: