🚧 Lesson 9 of 10 in Level 03
Level 03 • Lesson 9

Transformer Variants

BERT, GPT, T5, and other architectures. Design choices explained.

BERT (2018)

Encoder-only, trained with MLM:

GPT Series

Decoder-only, autoregressive:

T5

Encoder-decoder, text-to-text:

# Every task is text-to-text Translation: "translate English to German: The cat" Classification: "classify: This movie was great"

Modern Variants

Knowledge Check

Q1: What type of architecture does BERT use, and what is its primary training objective?

A: Encoder-only architecture with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Q2: Why can't GPT models use bidirectional attention during pre-training?

A: GPT is autoregressive (decoder-only) and predicts tokens left-to-right. Bidirectional attention would allow the model to "see" future tokens, breaking the causal structure needed for generation.

Q3: What makes T5's approach to NLP tasks unique compared to BERT or GPT?

A: T5 frames every task as text-to-text using an encoder-decoder architecture, allowing unified training across translation, classification, QA, and more with the same objective.

Q4: Which architecture would you choose for: (a) text classification, (b) text generation, (c) machine translation?

A: (a) BERT/encoder-only for classification, (b) GPT/decoder-only for generation, (c) T5/encoder-decoder for translation.

Q5: What is the key advantage of models like LLaMA in the open-source ecosystem?

A: LLaMA provides open weights with efficient architecture, enabling researchers and developers to run and fine-tune powerful models locally without API dependencies.

Q6: Why does T5 use an encoder-decoder architecture instead of encoder-only or decoder-only?

A: The encoder-decoder design allows T5 to process input sequences bidirectionally (encoder) while generating output sequences autoregressively (decoder), making it ideal for sequence-to-sequence tasks like translation and summarization.

Q7: How does the parameter count progression in GPT models (117M → 1.5B → 175B → 1T+) demonstrate scaling laws?

A: Each order-of-magnitude increase in parameters has yielded emergent capabilities and improved performance, showing that model capability scales predictably with size, data, and compute.

Key Takeaways

  • Architecture Matters: Encoder-only (BERT) excels at understanding tasks like classification; decoder-only (GPT) dominates generation; encoder-decoder (T5) handles sequence-to-sequence tasks like translation.
  • Training Objectives Shape Capabilities: MLM enables bidirectional understanding but isn't generative; autoregressive training enables generation but limits context to left-side only.
  • Scale Brings Emergence: Each GPT generation (117M → 1.5B → 175B → 1T+) demonstrated that capability scales predictably with parameters, data, and compute.
  • Unified Frameworks: T5's text-to-text approach unifies all NLP tasks under one paradigm, simplifying training and deployment across diverse applications.
  • Open Weights Democratize Access: Models like LLaMA provide researchers and developers with powerful architectures they can run locally, accelerating innovation outside major AI labs.

Practical Exercises

Exercise 1: Architecture Selection

For each task below, identify which transformer architecture (BERT/encoder-only, GPT/decoder-only, or T5/encoder-decoder) would be most appropriate and explain why:

  • Sentiment analysis of product reviews
  • Writing a story continuation
  • Translating English to French
  • Named entity recognition
  • Summarizing a news article

Solution:

  • Sentiment analysis: BERT (encoder-only) — bidirectional context captures nuanced sentiment from the full review.
  • Story continuation: GPT (decoder-only) — autoregressive generation excels at creative text production.
  • Translation: T5 (encoder-decoder) — sequence-to-sequence design maps input language to output language.
  • NER: BERT (encoder-only) — token-level classification benefits from bidirectional context.
  • Summarization: T5 (encoder-decoder) — encoder processes full article, decoder generates condensed version.

Exercise 2: Attention Mask Analysis

Given the sentence "The cat sat on the mat", write out the attention mask for:

A) BERT's bidirectional self-attention (which positions can attend to which?)

B) GPT's causal self-attention (which positions can attend to which?)

Solution:

# Tokens: [The] [cat] [sat] [on] [the] [mat] # Position: 0 1 2 3 4 5 # A) BERT bidirectional attention - all positions can attend to all others: # Each token sees: [The, cat, sat, on, the, mat] (full visibility) # B) GPT causal attention - each position can only attend to previous positions: # [The] -> can see: [The] # [cat] -> can see: [The, cat] # [sat] -> can see: [The, cat, sat] # [on] -> can see: [The, cat, sat, on] # [the] -> can see: [The, cat, sat, on, the] # [mat] -> can see: [The, cat, sat, on, the, mat]

This causal masking is why GPT can generate text token-by-token without "cheating" by looking ahead.

Quick Quiz

Test your understanding of transformer variants with these multiple-choice questions.

Q1: Which transformer architecture uses bidirectional self-attention during pre-training?

  • A) GPT (decoder-only)
  • B) BERT (encoder-only)
  • C) Both A and B
  • D) Neither

✓ Answer: B) BERT uses bidirectional attention, while GPT uses causal (left-to-right) attention.

Q2: What is the primary training objective for GPT models?

  • A) Masked Language Modeling (MLM)
  • B) Next Sentence Prediction (NSP)
  • C) Autoregressive next-token prediction
  • D) Denoising autoencoding

✓ Answer: C) GPT predicts the next token given all previous tokens (autoregressive).

Q3: T5's encoder-decoder architecture is best suited for which type of task?

  • A) Text classification only
  • B) Sequence-to-sequence tasks like translation
  • C) Image generation
  • D) Token-level classification only

✓ Answer: B) Encoder-decoder excels at sequence-to-sequence tasks where input and output lengths differ.

Q4: Which model introduced the concept of framing ALL NLP tasks as text-to-text?

  • A) BERT
  • B) GPT-3
  • C) T5
  • D) LLaMA

✓ Answer: C) T5 unified all tasks under a single text-to-text framework.

Q5: For a task requiring deep bidirectional understanding of context (like sentiment analysis), which architecture is preferred?

  • A) Decoder-only (GPT-style)
  • B) Encoder-only (BERT-style)
  • C) Encoder-decoder (T5-style)
  • D) All are equally suitable

✓ Answer: B) Encoder-only with bidirectional attention captures full context for understanding tasks.