Three Variants
The original "Attention Is All You Need" paper introduced an encoder-decoder architecture. Since then, two other variants have become popular:
Encoder-Only
Bidirectional attention. Can see entire input at once.
Examples: BERT, RoBERTa
Best for: Understanding tasks (classification, NER)
Decoder-Only
Causal (left-to-right) attention. Autoregressive generation.
Examples: GPT, LLaMA, Claude
Best for: Generation (text completion, chat)
Encoder-Decoder
Separate encoder and decoder. Cross-attention between them.
Examples: T5, BART, original Transformer
Best for: Translation, summarization
Encoder-Only (BERT-style)
Use Cases
- Text classification (sentiment, topic)
- Named Entity Recognition (NER)
- Question answering (extractive)
- Sentence embeddings
Training
Typically trained with Masked Language Modeling (MLM):
Decoder-Only (GPT-style)
Why Causal?
For generation, we can't "see" the future tokens we haven't generated yet!
Use Cases
- Text generation/completion
- Chatbots and conversational AI
- Code generation
- Creative writing
Training
Trained with Causal Language Modeling (next token prediction):
Encoder-Decoder (T5-style)
Two separate transformers with cross-attention:
Cross-Attention
The decoder has an extra attention layer that attends to encoder outputs:
Use Cases
- Machine translation
- Summarization
- Question answering (generative)
- Text-to-text tasks
Comparison
| Feature | Encoder | Decoder | Enc-Dec |
|---|---|---|---|
| Attention | Bidirectional | Causal | Both |
| Can generate? | No | Yes | Yes |
| Training | MLM | CLM | Seq2seq |
| Parameters | ~110M (BERT) | 175B (GPT-3) | 11B (T5) |
Why Decoder-Only Dominates
Most modern LLMs (GPT, Claude, LLaMA) are decoder-only. Why?
- Simplicity: Single stack, easier to scale
- Efficiency: Can use KV caching for fast generation
- Universality: Can do most tasks via prompting
- Scales better: Empirically works best at large scale
Exercises
Exercise 1: Architecture Choice
For each task, which architecture is best?
a) Sentiment classification
b) Machine translation
c) Code completion
d) Named entity recognition
Exercise 2: Causal Masking
Why can't decoder-only models use bidirectional attention during training?
Exercise 3: Cross-Attention
What does cross-attention allow the decoder to do that self-attention doesn't?
📝 Knowledge Check
Question 1: Which architecture uses bidirectional attention?
- A) Decoder-only models like GPT
- B) Encoder-only models like BERT ✓
- C) Neither uses bidirectional attention
- D) Both use bidirectional attention equally
Explanation: Encoder-only models like BERT use bidirectional self-attention, meaning every token can attend to every other token in the input. This is why they're excellent for understanding tasks like classification and NER.
Question 2: Why do decoder-only models use causal (left-to-right) attention?
- A) To reduce memory usage
- B) To make training faster
- C) They can't "see" future tokens they haven't generated yet ✓
- D) Causal attention is more mathematically elegant
Explanation: During text generation, the model generates tokens one at a time. At each step, it can only attend to previously generated tokens because future tokens don't exist yet. Causal masking enforces this constraint.
Question 3: What is the purpose of cross-attention in encoder-decoder models?
- A) To reduce the number of parameters
- B) To allow the decoder to attend to encoder outputs ✓
- C) To enable bidirectional attention in the decoder
- D) To speed up inference
Explanation: Cross-attention layers allow the decoder to "look at" the encoder's representation of the input while generating output. This is crucial for tasks like translation where the output must be conditioned on the entire input sequence.
Question 4: Which training objective is used for encoder-only models like BERT?
- A) Causal Language Modeling (CLM)
- B) Sequence-to-sequence (Seq2seq)
- C) Masked Language Modeling (MLM) ✓
- D) Reinforcement Learning from Human Feedback (RLHF)
Explanation: BERT and similar encoder-only models are trained with Masked Language Modeling (MLM), where random tokens are masked and the model must predict them from surrounding context.