๐Ÿšง Lesson 5 of 10 in Level 03
Level 03 โ€ข Lesson 5

The Transformer Block

Putting it all together. The complete transformer encoder/decoder block.

The Complete Block

A transformer block combines attention and feed-forward layers with residual connections and normalization:

Transformer Block Architecture

Input
โ†“
Layer Norm
โ†“
Multi-Head Attention
โ†“
+ Residual
โ†“
Layer Norm
โ†“
Feed-Forward Network
โ†“
+ Residual โ†’ Output
Two Main Components:
1. Attention sub-layer: Mixes information across sequence (communication)
2. FFN sub-layer: Processes each position independently (computation)

Attention Sub-Layer

# Attention sub-layer (with pre-norm) def attention_sublayer(x): # Layer normalization normed = layer_norm(x) # Multi-head self-attention attn_output = multi_head_attention(normed, normed, normed) # Residual connection output = x + attn_output return output

What It Does

Feed-Forward Sub-Layer

# Feed-forward sub-layer (with pre-norm) def ffn_sublayer(x): # Layer normalization normed = layer_norm(x) # Feed-forward network # Expands to 4ร— dimension, then projects back ff_output = linear2(gelu(linear1(normed))) # Residual connection output = x + ff_output return output

Structure

FFN(x) = W_2 ยท GELU(W_1 ยท x + b_1) + b_2 Where: - W_1: (d_model, 4ร—d_model) # Expansion - W_2: (4ร—d_model, d_model) # Projection back - GELU: Activation function
Why 4ร— expansion? Creates a "bottleneck" that forces the network to learn compressed representations. The inner layer has more capacity for computation.

Pre-Norm vs Post-Norm

Two Variants

Post-Norm (Original)
x = x + Attention(LayerNorm(x)) x = x + FFN(LayerNorm(x))

LayerNorm after residual

Harder to train deep

Pre-Norm (Modern)
x = LayerNorm(x + Attention(x)) x = LayerNorm(x + FFN(x))

LayerNorm before sub-layer

More stable, standard now

Pre-norm is now standard in GPT, LLaMA, and most modern transformers.

Complete Transformer Block

class TransformerBlock: def __init__(self, d_model=512, num_heads=8, d_ff=2048): self.ln1 = LayerNorm(d_model) self.ln2 = LayerNorm(d_model) self.attention = MultiHeadAttention(d_model, num_heads) self.ffn = FeedForward(d_model, d_ff) def forward(self, x): # Attention sub-layer attn_out = self.attention(self.ln1(x)) x = x + attn_out # Residual # FFN sub-layer ffn_out = self.ffn(self.ln2(x)) x = x + ffn_out # Residual return x

Stacking Blocks

Transformers stack many identical blocks:

# Full transformer encoder class TransformerEncoder: def __init__(self, num_layers=6, d_model=512, num_heads=8): self.blocks = [TransformerBlock(d_model, num_heads) for _ in range(num_layers)] def forward(self, x): for block in self.blocks: x = block(x) return x

Typical Configurations

Model Layers d_model Heads
BERT-Base1276812
GPT-21276812
GPT-3961228896
LLaMA-232409632

What Each Layer Learns

Progressive Abstraction

Lower layers โ†’ Higher layers:

  • Early layers: Syntax, part-of-speech, local patterns
  • Middle layers: Phrases, semantic relationships
  • Late layers: Global meaning, discourse, reasoning
Research Finding: You can often remove or prune late layers with minimal impact on simple tasks, but you need them for complex reasoning.

Exercises

Exercise 1: Block Components

List the components of a transformer block in order. Which ones have residual connections?

Exercise 2: FFN Dimensions

If d_model = 768 and d_ff = 3072, how many parameters are in the FFN layer?

Exercise 3: Why Residuals?

Why are residual connections crucial for training deep transformers?