🚧 Lesson 10 of 10 in Level 03
Level 03 • Lesson 10

Build a Transformer

Complete implementation from scratch. Full code walkthrough.

Complete Implementation

import torch import torch.nn as nn import math class Transformer(nn.Module): def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6, d_ff=2048, dropout=0.1): super().__init__() self.embedding = nn.Embedding(vocab_size, d_model) self.pos_encoding = PositionalEncoding(d_model) self.layers = nn.ModuleList([ TransformerBlock(d_model, nhead, d_ff, dropout) for _ in range(num_layers) ]) self.output = nn.Linear(d_model, vocab_size) self.dropout = nn.Dropout(dropout) def forward(self, x, mask=None): x = self.embedding(x) x = self.pos_encoding(x) x = self.dropout(x) for layer in self.layers: x = layer(x, mask) return self.output(x)
Try it yourself! This is a complete, working transformer. Train it on your favorite text dataset.

Training Loop

model = Transformer(vocab_size=10000) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) for batch in dataloader: optimizer.zero_grad() output = model(batch.input) loss = cross_entropy(output, batch.target) loss.backward() optimizer.step()

✏️ Coding Exercises

Exercise 1: Implement Multi-Head Attention

Complete the missing parts of the MultiHeadAttention class:

class MultiHeadAttention(nn.Module): def __init__(self, d_model, nhead): super().__init__() self.d_model = d_model self.nhead = nhead self.d_k = d_model // nhead # TODO: Define linear projections self.W_q = nn.Linear(____, ____) self.W_k = nn.Linear(____, ____) self.W_v = nn.Linear(____, ____) self.W_o = nn.Linear(____, ____) def forward(self, query, key, value, mask=None): batch_size = query.size(0) # TODO: Apply linear projections Q = self.W_q(query).view(batch_size, -1, self.nhead, self.d_k).transpose(1, 2) K = ____ V = ____ # TODO: Compute attention scores scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attn = F.softmax(scores, dim=-1) context = torch.matmul(attn, V) # TODO: Reshape and apply output projection context = context.transpose(1, 2).contiguous().view(batch_size, -1, ____) return self.W_o(context)
Hint: The linear projections should map from d_model to d_model. Remember to transpose and reshape the context tensor before the final linear layer.

Exercise 2: Build a Positional Encoding

Implement sinusoidal positional encoding:

class PositionalEncoding(nn.Module): def __init__(self, d_model, max_len=5000): super().__init__() # TODO: Create positional encoding matrix pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) # TODO: Calculate div_term using geometric progression div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(____ / ____)) # TODO: Apply sin to even indices, cos to odd indices pe[:, 0::2] = torch.sin(____) pe[:, 1::2] = torch.cos(____) self.register_buffer('pe', pe.unsqueeze(0)) def forward(self, x): # TODO: Add positional encoding to input return x + self.pe[:, :x.size(____)]
Answer Key:
  • Exercise 1: All linear projections are (d_model, d_model); K/V projections similar to Q; final view uses self.d_model
  • Exercise 2: div_term uses math.log(10000.0) / d_model; sin/cos take position * div_term; x.size(1) for sequence length

📝 Knowledge Check Quiz

Test your understanding of transformer implementation with these questions:

Question 1

What is the purpose of positional encoding in a transformer?

A) To reduce the model size
B) To add sequence order information since self-attention is position-invariant
C) To speed up training
D) To prevent overfitting

Answer: B) Self-attention treats all positions equally, so positional encoding injects information about token positions.

Question 2

Why do we divide attention scores by √d_k in the scaled dot-product attention?

A) To normalize the output to [0, 1]
B) To prevent gradients from becoming too small
C) To prevent dot products from growing too large in high dimensions, which would push softmax into regions with small gradients
D) To make the computation faster

Answer: C) For large d_k, dot products can become large, pushing softmax into regions with extremely small gradients.

Question 3

In multi-head attention, what does each "head" learn?

A) A different layer of the network
B) A different projection of queries, keys, and values, allowing the model to attend to information from different representation subspaces
C) A different training objective
D) A different vocabulary

Answer: B) Each head learns different attention patterns by projecting Q, K, V into different subspaces.

Question 4

What is the role of the feed-forward network (FFN) in each transformer layer?

A) To replace the attention mechanism
B) To process each position independently and add non-linearity, increasing model capacity
C) To encode positional information
D) To generate the final output vocabulary

Answer: B) The FFN applies the same transformation to each position independently, adding non-linear capacity to the model.

Question 5

Why are residual connections and layer normalization important in transformers?

A) They make the model smaller
B) They help gradients flow through deep networks and stabilize training
C) They reduce the vocabulary size
D) They eliminate the need for attention

Answer: B) Residual connections help gradients flow, and layer normalization stabilizes training by normalizing inputs to each sub-layer.

🎯 Key Takeaways

  • Transformer Architecture: Consists of embedding + positional encoding, stacked encoder/decoder blocks with multi-head attention and feed-forward networks, followed by output projection.
  • Multi-Head Attention: Projects queries, keys, and values into multiple subspaces, allowing the model to attend to different representation aspects simultaneously. Scaled by √d_k to prevent softmax saturation.
  • Positional Encoding: Injects sequence position information using sinusoidal functions, enabling the model to understand token order despite parallel processing.
  • Residual Connections & Layer Norm: Essential for training deep networks—residual connections help gradient flow, while layer normalization stabilizes activations.
  • Training Essentials: Use AdamW optimizer with learning rate ~1e-4, apply dropout for regularization, and mask padding tokens to prevent attention leakage.