Lesson 9: Transformer Variants

BERT (2018)

Encoder-only, trained with MLM:

Bidirectional context
[MASK] token prediction
Next sentence prediction (NSP)

GPT Series

Decoder-only, autoregressive:

GPT-1: 117M params
GPT-2: 1.5B params
GPT-3: 175B params
GPT-4: Estimated 1T+ params

T5

Encoder-decoder, text-to-text:

# Every task is text-to-text
Translation: "translate English to German: The cat"
Classification: "classify: This movie was great"
      

Modern Variants

LLaMA: Open weights, efficient architecture
PaLM: Google's large model
Claude: Constitutional AI training

Knowledge Check

Q1: What type of architecture does BERT use, and what is its primary training objective?

A: Encoder-only architecture with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Q2: Why can't GPT models use bidirectional attention during pre-training?

A: GPT is autoregressive (decoder-only) and predicts tokens left-to-right. Bidirectional attention would allow the model to "see" future tokens, breaking the causal structure needed for generation.

Q3: What makes T5's approach to NLP tasks unique compared to BERT or GPT?

A: T5 frames every task as text-to-text using an encoder-decoder architecture, allowing unified training across translation, classification, QA, and more with the same objective.

Q4: Which architecture would you choose for: (a) text classification, (b) text generation, (c) machine translation?

A: (a) BERT/encoder-only for classification, (b) GPT/decoder-only for generation, (c) T5/encoder-decoder for translation.

Q5: What is the key advantage of models like LLaMA in the open-source ecosystem?

A: LLaMA provides open weights with efficient architecture, enabling researchers and developers to run and fine-tune powerful models locally without API dependencies.

Q6: Why does T5 use an encoder-decoder architecture instead of encoder-only or decoder-only?

A: The encoder-decoder design allows T5 to process input sequences bidirectionally (encoder) while generating output sequences autoregressively (decoder), making it ideal for sequence-to-sequence tasks like translation and summarization.

Q7: How does the parameter count progression in GPT models (117M → 1.5B → 175B → 1T+) demonstrate scaling laws?

A: Each order-of-magnitude increase in parameters has yielded emergent capabilities and improved performance, showing that model capability scales predictably with size, data, and compute.

Key Takeaways

        Architecture Matters: Encoder-only (BERT) excels at understanding tasks like classification; decoder-only (GPT) dominates generation; encoder-decoder (T5) handles sequence-to-sequence tasks like translation.
Training Objectives Shape Capabilities: MLM enables bidirectional understanding but isn't generative; autoregressive training enables generation but limits context to left-side only.
Scale Brings Emergence: Each GPT generation (117M → 1.5B → 175B → 1T+) demonstrated that capability scales predictably with parameters, data, and compute.
Unified Frameworks: T5's text-to-text approach unifies all NLP tasks under one paradigm, simplifying training and deployment across diverse applications.
Open Weights Democratize Access: Models like LLaMA provide researchers and developers with powerful architectures they can run locally, accelerating innovation outside major AI labs.

      

Practical Exercises

Exercise 1: Architecture Selection

For each task below, identify which transformer architecture (BERT/encoder-only, GPT/decoder-only, or T5/encoder-decoder) would be most appropriate and explain why:

Sentiment analysis of product reviews
Writing a story continuation
Translating English to French
Named entity recognition
Summarizing a news article

Solution:

Sentiment analysis: BERT (encoder-only) — bidirectional context captures nuanced sentiment from the full review.
Story continuation: GPT (decoder-only) — autoregressive generation excels at creative text production.
Translation: T5 (encoder-decoder) — sequence-to-sequence design maps input language to output language.
NER: BERT (encoder-only) — token-level classification benefits from bidirectional context.
Summarization: T5 (encoder-decoder) — encoder processes full article, decoder generates condensed version.

Exercise 2: Attention Mask Analysis

Given the sentence "The cat sat on the mat", write out the attention mask for:

A) BERT's bidirectional self-attention (which positions can attend to which?)

B) GPT's causal self-attention (which positions can attend to which?)

Solution:

# Tokens: [The] [cat] [sat] [on] [the] [mat]
# Position:  0     1     2     3     4     5

# A) BERT bidirectional attention - all positions can attend to all others:
# Each token sees: [The, cat, sat, on, the, mat] (full visibility)

# B) GPT causal attention - each position can only attend to previous positions:
# [The]  -> can see: [The]
# [cat]  -> can see: [The, cat]
# [sat]  -> can see: [The, cat, sat]
# [on]   -> can see: [The, cat, sat, on]
# [the]  -> can see: [The, cat, sat, on, the]
# [mat]  -> can see: [The, cat, sat, on, the, mat]
        

This causal masking is why GPT can generate text token-by-token without "cheating" by looking ahead.

Quick Quiz

Test your understanding of transformer variants with these multiple-choice questions.

Q1: Which transformer architecture uses bidirectional self-attention during pre-training?

A) GPT (decoder-only)
B) BERT (encoder-only)
C) Both A and B
D) Neither

✓ Answer: B) BERT uses bidirectional attention, while GPT uses causal (left-to-right) attention.

Q2: What is the primary training objective for GPT models?

A) Masked Language Modeling (MLM)
B) Next Sentence Prediction (NSP)
C) Autoregressive next-token prediction
D) Denoising autoencoding

✓ Answer: C) GPT predicts the next token given all previous tokens (autoregressive).

Q3: T5's encoder-decoder architecture is best suited for which type of task?

A) Text classification only
B) Sequence-to-sequence tasks like translation
C) Image generation
D) Token-level classification only

✓ Answer: B) Encoder-decoder excels at sequence-to-sequence tasks where input and output lengths differ.

Q4: Which model introduced the concept of framing ALL NLP tasks as text-to-text?

A) BERT
B) GPT-3
C) T5
D) LLaMA

✓ Answer: C) T5 unified all tasks under a single text-to-text framework.

Q5: For a task requiring deep bidirectional understanding of context (like sentiment analysis), which architecture is preferred?

A) Decoder-only (GPT-style)
B) Encoder-only (BERT-style)
C) Encoder-decoder (T5-style)
D) All are equally suitable

✓ Answer: B) Encoder-only with bidirectional attention captures full context for understanding tasks.