Lesson 2: Pre-training Objectives

Causal Language Modeling (CLM)

Used by GPT and decoder-only models:

Input:  "The cat sat on the"
Target: "cat sat on the mat"

Loss: -log P("cat" | "The") - log P("sat" | "The cat") - ...
      

Advantage: Natural for generation. Model learns to predict what comes next.

Masked Language Modeling (MLM)

Used by BERT and encoder-only models:

Input:  "The [MASK] sat on the mat"
Target: "cat"

Mask ~15% of tokens randomly
      

Span Corruption

Used by T5:

Input:  "The  sat on the "
Target: " cat  mat "

Replace consecutive spans with special tokens
      

Knowledge Check Quiz

Q1: Which pre-training objective is used by GPT models?

A) Masked Language Modeling (MLM)
B) Causal Language Modeling (CLM)
C) Span Corruption
D) Denoising Autoencoder

Answer: B) CLM - GPT models predict the next token given all previous tokens.

Q2: What percentage of tokens are typically masked in BERT's MLM objective?

A) 5%
B) 10%
C) 15%
D) 25%

Answer: C) 15% - BERT masks approximately 15% of input tokens randomly.

Q3: Which model architecture uses span corruption as its pre-training objective?

A) GPT-3
B) BERT
C) T5
D) RoBERTa

Answer: C) T5 - Text-to-Text Transfer Transformer uses span corruption with sentinel tokens.

Q4: What is the main advantage of CLM over MLM for text generation?

A) It requires less compute
B) It naturally learns to predict future tokens left-to-right
C) It can see the entire context bidirectionally
D) It works better for classification tasks

Answer: B) CLM naturally learns left-to-right generation, matching how we generate text.

Hands-On Exercises

Exercise 1: Implement MLM Masking

Write a function that applies Masked Language Modeling to a token sequence:

def apply_mlm_mask(tokens, mask_prob=0.15, mask_token="[MASK]"):
    """
    Randomly mask tokens for MLM training.
    
    Args:
        tokens: List of tokens (e.g., ['The', 'cat', 'sat', 'on', 'the', 'mat'])
        mask_prob: Probability of masking each token (default 0.15)
        mask_token: Token to use for masking
    
    Returns:
        masked_tokens: List with some tokens replaced by [MASK]
        target_indices: Dict mapping positions to original tokens
    
    Example:
        >>> tokens = ['The', 'cat', 'sat', 'on', 'the', 'mat']
        >>> masked, targets = apply_mlm_mask(tokens)
        >>> # masked might be: ['The', '[MASK]', 'sat', 'on', 'the', '[MASK]']
    """
    import random
    masked_tokens = tokens.copy()
    target_indices = {}
    
    for i, token in enumerate(tokens):
        if random.random() < mask_prob:
            target_indices[i] = token
            masked_tokens[i] = mask_token
    
    return masked_tokens, target_indices
        

Challenge: Extend this to implement the BERT strategy: 80% of the time replace with [MASK], 10% with a random token, 10% keep original.

Exercise 2: Causal Mask Creation

Implement a function that creates a causal (look-ahead) mask for self-attention:

def create_causal_mask(seq_len):
    """
    Create a causal mask for decoder self-attention.
    
    Args:
        seq_len: Length of the sequence
    
    Returns:
        mask: 2D array where mask[i][j] = True if position i can attend to j
    
    Example:
        >>> mask = create_causal_mask(4)
        >>> # mask should be:
        >>> # [[True, False, False, False],
        >>> #  [True, True, False, False],
        >>> #  [True, True, True, False],
        >>> #  [True, True, True, True]]
    """
    import numpy as np
    mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))
    return mask

# Visualize the causal mask
import numpy as np
mask = create_causal_mask(6)
print("Causal Mask (6x6):")
print(mask.astype(int))
        

Key Insight: The causal mask ensures position i can only attend to positions ≤ i, preventing the model from "cheating" by looking at future tokens.