The Complete Block
A transformer block combines attention and feed-forward layers with residual connections and normalization:
Transformer Block Architecture
Input
โ
Layer Norm
โ
Multi-Head Attention
โ
+ Residual
โ
Layer Norm
โ
Feed-Forward Network
โ
+ Residual โ Output
Two Main Components:
1. Attention sub-layer: Mixes information across sequence (communication)
2. FFN sub-layer: Processes each position independently (computation)
1. Attention sub-layer: Mixes information across sequence (communication)
2. FFN sub-layer: Processes each position independently (computation)
Attention Sub-Layer
# Attention sub-layer (with pre-norm)
def attention_sublayer(x):
# Layer normalization
normed = layer_norm(x)
# Multi-head self-attention
attn_output = multi_head_attention(normed, normed, normed)
# Residual connection
output = x + attn_output
return output
What It Does
- Each token looks at all other tokens
- Gathers relevant information from the context
- Updates representation based on relationships
- Residual connection preserves original information
Feed-Forward Sub-Layer
# Feed-forward sub-layer (with pre-norm)
def ffn_sublayer(x):
# Layer normalization
normed = layer_norm(x)
# Feed-forward network
# Expands to 4ร dimension, then projects back
ff_output = linear2(gelu(linear1(normed)))
# Residual connection
output = x + ff_output
return output
Structure
FFN(x) = W_2 ยท GELU(W_1 ยท x + b_1) + b_2
Where:
- W_1: (d_model, 4รd_model) # Expansion
- W_2: (4รd_model, d_model) # Projection back
- GELU: Activation function
Why 4ร expansion? Creates a "bottleneck" that forces the network to learn
compressed representations. The inner layer has more capacity for computation.
Pre-Norm vs Post-Norm
Two Variants
Post-Norm (Original)
x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))
LayerNorm after residual
Harder to train deep
Pre-Norm (Modern)
x = LayerNorm(x + Attention(x))
x = LayerNorm(x + FFN(x))
LayerNorm before sub-layer
More stable, standard now
Pre-norm is now standard in GPT, LLaMA, and most modern transformers.
Complete Transformer Block
class TransformerBlock:
def __init__(self, d_model=512, num_heads=8, d_ff=2048):
self.ln1 = LayerNorm(d_model)
self.ln2 = LayerNorm(d_model)
self.attention = MultiHeadAttention(d_model, num_heads)
self.ffn = FeedForward(d_model, d_ff)
def forward(self, x):
# Attention sub-layer
attn_out = self.attention(self.ln1(x))
x = x + attn_out # Residual
# FFN sub-layer
ffn_out = self.ffn(self.ln2(x))
x = x + ffn_out # Residual
return x
Stacking Blocks
Transformers stack many identical blocks:
# Full transformer encoder
class TransformerEncoder:
def __init__(self, num_layers=6, d_model=512, num_heads=8):
self.blocks = [TransformerBlock(d_model, num_heads)
for _ in range(num_layers)]
def forward(self, x):
for block in self.blocks:
x = block(x)
return x
Typical Configurations
| Model | Layers | d_model | Heads |
|---|---|---|---|
| BERT-Base | 12 | 768 | 12 |
| GPT-2 | 12 | 768 | 12 |
| GPT-3 | 96 | 12288 | 96 |
| LLaMA-2 | 32 | 4096 | 32 |
What Each Layer Learns
Progressive Abstraction
Lower layers โ Higher layers:
- Early layers: Syntax, part-of-speech, local patterns
- Middle layers: Phrases, semantic relationships
- Late layers: Global meaning, discourse, reasoning
Research Finding: You can often remove or prune late layers with minimal
impact on simple tasks, but you need them for complex reasoning.
Exercises
Exercise 1: Block Components
List the components of a transformer block in order. Which ones have residual connections?
Exercise 2: FFN Dimensions
If d_model = 768 and d_ff = 3072, how many parameters are in the FFN layer?
Exercise 3: Why Residuals?
Why are residual connections crucial for training deep transformers?