🚧 Lesson 2 of 35 in Level 04
Level 04 • Lesson 2

Pre-training Objectives

MLM, CLM, span corruption. Different ways to pre-train language models.

Causal Language Modeling (CLM)

Used by GPT and decoder-only models:

Input: "The cat sat on the" Target: "cat sat on the mat" Loss: -log P("cat" | "The") - log P("sat" | "The cat") - ...
Advantage: Natural for generation. Model learns to predict what comes next.

Masked Language Modeling (MLM)

Used by BERT and encoder-only models:

Input: "The [MASK] sat on the mat" Target: "cat" Mask ~15% of tokens randomly

Span Corruption

Used by T5:

Input: "The sat on the " Target: " cat mat " Replace consecutive spans with special tokens

Knowledge Check Quiz

Q1: Which pre-training objective is used by GPT models?

A) Masked Language Modeling (MLM)
B) Causal Language Modeling (CLM)
C) Span Corruption
D) Denoising Autoencoder

Answer: B) CLM - GPT models predict the next token given all previous tokens.

Q2: What percentage of tokens are typically masked in BERT's MLM objective?

A) 5%
B) 10%
C) 15%
D) 25%

Answer: C) 15% - BERT masks approximately 15% of input tokens randomly.

Q3: Which model architecture uses span corruption as its pre-training objective?

A) GPT-3
B) BERT
C) T5
D) RoBERTa

Answer: C) T5 - Text-to-Text Transfer Transformer uses span corruption with sentinel tokens.

Q4: What is the main advantage of CLM over MLM for text generation?

A) It requires less compute
B) It naturally learns to predict future tokens left-to-right
C) It can see the entire context bidirectionally
D) It works better for classification tasks

Answer: B) CLM naturally learns left-to-right generation, matching how we generate text.

Hands-On Exercises

Exercise 1: Implement MLM Masking

Write a function that applies Masked Language Modeling to a token sequence:

def apply_mlm_mask(tokens, mask_prob=0.15, mask_token="[MASK]"): """ Randomly mask tokens for MLM training. Args: tokens: List of tokens (e.g., ['The', 'cat', 'sat', 'on', 'the', 'mat']) mask_prob: Probability of masking each token (default 0.15) mask_token: Token to use for masking Returns: masked_tokens: List with some tokens replaced by [MASK] target_indices: Dict mapping positions to original tokens Example: >>> tokens = ['The', 'cat', 'sat', 'on', 'the', 'mat'] >>> masked, targets = apply_mlm_mask(tokens) >>> # masked might be: ['The', '[MASK]', 'sat', 'on', 'the', '[MASK]'] """ import random masked_tokens = tokens.copy() target_indices = {} for i, token in enumerate(tokens): if random.random() < mask_prob: target_indices[i] = token masked_tokens[i] = mask_token return masked_tokens, target_indices

Challenge: Extend this to implement the BERT strategy: 80% of the time replace with [MASK], 10% with a random token, 10% keep original.

Exercise 2: Causal Mask Creation

Implement a function that creates a causal (look-ahead) mask for self-attention:

def create_causal_mask(seq_len): """ Create a causal mask for decoder self-attention. Args: seq_len: Length of the sequence Returns: mask: 2D array where mask[i][j] = True if position i can attend to j Example: >>> mask = create_causal_mask(4) >>> # mask should be: >>> # [[True, False, False, False], >>> # [True, True, False, False], >>> # [True, True, True, False], >>> # [True, True, True, True]] """ import numpy as np mask = np.tril(np.ones((seq_len, seq_len), dtype=bool)) return mask # Visualize the causal mask import numpy as np mask = create_causal_mask(6) print("Causal Mask (6x6):") print(mask.astype(int))

Key Insight: The causal mask ensures position i can only attend to positions ≤ i, preventing the model from "cheating" by looking at future tokens.