Lesson 5: The Transformer Block

The Complete Block

A transformer block combines attention and feed-forward layers with residual connections and normalization:

Transformer Block Architecture

Input

↓

Layer Norm

↓

Multi-Head Attention

↓

+ Residual

↓

Layer Norm

↓

Feed-Forward Network

↓

+ Residual → Output

        Two Main Components:

        1. Attention sub-layer: Mixes information across sequence (communication)

        2. FFN sub-layer: Processes each position independently (computation)

Attention Sub-Layer

# Attention sub-layer (with pre-norm)
def attention_sublayer(x):
    # Layer normalization
    normed = layer_norm(x)
    
    # Multi-head self-attention
    attn_output = multi_head_attention(normed, normed, normed)
    
    # Residual connection
    output = x + attn_output
    
    return output
      

What It Does

Each token looks at all other tokens
Gathers relevant information from the context
Updates representation based on relationships
Residual connection preserves original information

Feed-Forward Sub-Layer

# Feed-forward sub-layer (with pre-norm)
def ffn_sublayer(x):
    # Layer normalization
    normed = layer_norm(x)
    
    # Feed-forward network
    # Expands to 4× dimension, then projects back
    ff_output = linear2(gelu(linear1(normed)))
    
    # Residual connection
    output = x + ff_output
    
    return output
      

Structure

FFN(x) = W_2 · GELU(W_1 · x + b_1) + b_2

Where:
- W_1: (d_model, 4×d_model)  # Expansion
- W_2: (4×d_model, d_model)  # Projection back
- GELU: Activation function
      

        Why 4× expansion? Creates a "bottleneck" that forces the network to learn 
        compressed representations. The inner layer has more capacity for computation.
      

Pre-Norm vs Post-Norm

Two Variants

Post-Norm (Original)

x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))
            

LayerNorm after residual

Harder to train deep

Pre-Norm (Modern)

x = LayerNorm(x + Attention(x))
x = LayerNorm(x + FFN(x))
            

LayerNorm before sub-layer

More stable, standard now

Pre-norm is now standard in GPT, LLaMA, and most modern transformers.

Complete Transformer Block

class TransformerBlock:
    def __init__(self, d_model=512, num_heads=8, d_ff=2048):
        self.ln1 = LayerNorm(d_model)
        self.ln2 = LayerNorm(d_model)
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)
    
    def forward(self, x):
        # Attention sub-layer
        attn_out = self.attention(self.ln1(x))
        x = x + attn_out  # Residual
        
        # FFN sub-layer
        ffn_out = self.ffn(self.ln2(x))
        x = x + ffn_out  # Residual
        
        return x
      

Stacking Blocks

Transformers stack many identical blocks:

# Full transformer encoder
class TransformerEncoder:
    def __init__(self, num_layers=6, d_model=512, num_heads=8):
        self.blocks = [TransformerBlock(d_model, num_heads) 
                      for _ in range(num_layers)]
    
    def forward(self, x):
        for block in self.blocks:
            x = block(x)
        return x
      

Typical Configurations

Model	Layers	d_model	Heads
BERT-Base	12	768	12
GPT-2	12	768	12
GPT-3	96	12288	96
LLaMA-2	32	4096	32

What Each Layer Learns

Progressive Abstraction

Lower layers → Higher layers:

Early layers: Syntax, part-of-speech, local patterns
Middle layers: Phrases, semantic relationships
Late layers: Global meaning, discourse, reasoning

          Research Finding: You can often remove or prune late layers with minimal 
          impact on simple tasks, but you need them for complex reasoning.
        

Exercises

Exercise 1: Block Components

List the components of a transformer block in order. Which ones have residual connections?

Exercise 2: FFN Dimensions

If d_model = 768 and d_ff = 3072, how many parameters are in the FFN layer?

Exercise 3: Why Residuals?

Why are residual connections crucial for training deep transformers?