BERT (2018)
Encoder-only, trained with MLM:
- Bidirectional context
- [MASK] token prediction
- Next sentence prediction (NSP)
GPT Series
Decoder-only, autoregressive:
- GPT-1: 117M params
- GPT-2: 1.5B params
- GPT-3: 175B params
- GPT-4: Estimated 1T+ params
T5
Encoder-decoder, text-to-text:
Modern Variants
- LLaMA: Open weights, efficient architecture
- PaLM: Google's large model
- Claude: Constitutional AI training
Knowledge Check
Q1: What type of architecture does BERT use, and what is its primary training objective?
A: Encoder-only architecture with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
Q2: Why can't GPT models use bidirectional attention during pre-training?
A: GPT is autoregressive (decoder-only) and predicts tokens left-to-right. Bidirectional attention would allow the model to "see" future tokens, breaking the causal structure needed for generation.
Q3: What makes T5's approach to NLP tasks unique compared to BERT or GPT?
A: T5 frames every task as text-to-text using an encoder-decoder architecture, allowing unified training across translation, classification, QA, and more with the same objective.
Q4: Which architecture would you choose for: (a) text classification, (b) text generation, (c) machine translation?
A: (a) BERT/encoder-only for classification, (b) GPT/decoder-only for generation, (c) T5/encoder-decoder for translation.
Q5: What is the key advantage of models like LLaMA in the open-source ecosystem?
A: LLaMA provides open weights with efficient architecture, enabling researchers and developers to run and fine-tune powerful models locally without API dependencies.
Q6: Why does T5 use an encoder-decoder architecture instead of encoder-only or decoder-only?
A: The encoder-decoder design allows T5 to process input sequences bidirectionally (encoder) while generating output sequences autoregressively (decoder), making it ideal for sequence-to-sequence tasks like translation and summarization.
Q7: How does the parameter count progression in GPT models (117M → 1.5B → 175B → 1T+) demonstrate scaling laws?
A: Each order-of-magnitude increase in parameters has yielded emergent capabilities and improved performance, showing that model capability scales predictably with size, data, and compute.
Key Takeaways
- Architecture Matters: Encoder-only (BERT) excels at understanding tasks like classification; decoder-only (GPT) dominates generation; encoder-decoder (T5) handles sequence-to-sequence tasks like translation.
- Training Objectives Shape Capabilities: MLM enables bidirectional understanding but isn't generative; autoregressive training enables generation but limits context to left-side only.
- Scale Brings Emergence: Each GPT generation (117M → 1.5B → 175B → 1T+) demonstrated that capability scales predictably with parameters, data, and compute.
- Unified Frameworks: T5's text-to-text approach unifies all NLP tasks under one paradigm, simplifying training and deployment across diverse applications.
- Open Weights Democratize Access: Models like LLaMA provide researchers and developers with powerful architectures they can run locally, accelerating innovation outside major AI labs.
Practical Exercises
Exercise 1: Architecture Selection
For each task below, identify which transformer architecture (BERT/encoder-only, GPT/decoder-only, or T5/encoder-decoder) would be most appropriate and explain why:
- Sentiment analysis of product reviews
- Writing a story continuation
- Translating English to French
- Named entity recognition
- Summarizing a news article
Solution:
- Sentiment analysis: BERT (encoder-only) — bidirectional context captures nuanced sentiment from the full review.
- Story continuation: GPT (decoder-only) — autoregressive generation excels at creative text production.
- Translation: T5 (encoder-decoder) — sequence-to-sequence design maps input language to output language.
- NER: BERT (encoder-only) — token-level classification benefits from bidirectional context.
- Summarization: T5 (encoder-decoder) — encoder processes full article, decoder generates condensed version.
Exercise 2: Attention Mask Analysis
Given the sentence "The cat sat on the mat", write out the attention mask for:
A) BERT's bidirectional self-attention (which positions can attend to which?)
B) GPT's causal self-attention (which positions can attend to which?)
Solution:
This causal masking is why GPT can generate text token-by-token without "cheating" by looking ahead.
Quick Quiz
Test your understanding of transformer variants with these multiple-choice questions.
Q1: Which transformer architecture uses bidirectional self-attention during pre-training?
- A) GPT (decoder-only)
- B) BERT (encoder-only)
- C) Both A and B
- D) Neither
✓ Answer: B) BERT uses bidirectional attention, while GPT uses causal (left-to-right) attention.
Q2: What is the primary training objective for GPT models?
- A) Masked Language Modeling (MLM)
- B) Next Sentence Prediction (NSP)
- C) Autoregressive next-token prediction
- D) Denoising autoencoding
✓ Answer: C) GPT predicts the next token given all previous tokens (autoregressive).
Q3: T5's encoder-decoder architecture is best suited for which type of task?
- A) Text classification only
- B) Sequence-to-sequence tasks like translation
- C) Image generation
- D) Token-level classification only
✓ Answer: B) Encoder-decoder excels at sequence-to-sequence tasks where input and output lengths differ.
Q4: Which model introduced the concept of framing ALL NLP tasks as text-to-text?
- A) BERT
- B) GPT-3
- C) T5
- D) LLaMA
✓ Answer: C) T5 unified all tasks under a single text-to-text framework.
Q5: For a task requiring deep bidirectional understanding of context (like sentiment analysis), which architecture is preferred?
- A) Decoder-only (GPT-style)
- B) Encoder-only (BERT-style)
- C) Encoder-decoder (T5-style)
- D) All are equally suitable
✓ Answer: B) Encoder-only with bidirectional attention captures full context for understanding tasks.