Causal Language Modeling (CLM)
Used by GPT and decoder-only models:
Masked Language Modeling (MLM)
Used by BERT and encoder-only models:
Span Corruption
Used by T5:
Knowledge Check Quiz
Q1: Which pre-training objective is used by GPT models?
A) Masked Language Modeling (MLM)
B) Causal Language Modeling (CLM)
C) Span Corruption
D) Denoising Autoencoder
Answer: B) CLM - GPT models predict the next token given all previous tokens.
Q2: What percentage of tokens are typically masked in BERT's MLM objective?
A) 5%
B) 10%
C) 15%
D) 25%
Answer: C) 15% - BERT masks approximately 15% of input tokens randomly.
Q3: Which model architecture uses span corruption as its pre-training objective?
A) GPT-3
B) BERT
C) T5
D) RoBERTa
Answer: C) T5 - Text-to-Text Transfer Transformer uses span corruption with sentinel tokens.
Q4: What is the main advantage of CLM over MLM for text generation?
A) It requires less compute
B) It naturally learns to predict future tokens left-to-right
C) It can see the entire context bidirectionally
D) It works better for classification tasks
Answer: B) CLM naturally learns left-to-right generation, matching how we generate text.
Hands-On Exercises
Exercise 1: Implement MLM Masking
Write a function that applies Masked Language Modeling to a token sequence:
Challenge: Extend this to implement the BERT strategy: 80% of the time replace with [MASK], 10% with a random token, 10% keep original.
Exercise 2: Causal Mask Creation
Implement a function that creates a causal (look-ahead) mask for self-attention:
Key Insight: The causal mask ensures position i can only attend to positions ≤ i, preventing the model from "cheating" by looking at future tokens.