🚧 Lesson 6 of 10 in Level 03
Level 03 • Lesson 6

Encoder vs Decoder

The three transformer variants. Encoder-only, decoder-only, and encoder-decoder architectures.

Three Variants

The original "Attention Is All You Need" paper introduced an encoder-decoder architecture. Since then, two other variants have become popular:

Encoder-Only

Bidirectional attention. Can see entire input at once.

Examples: BERT, RoBERTa

Best for: Understanding tasks (classification, NER)

Decoder-Only

Causal (left-to-right) attention. Autoregressive generation.

Examples: GPT, LLaMA, Claude

Best for: Generation (text completion, chat)

Encoder-Decoder

Separate encoder and decoder. Cross-attention between them.

Examples: T5, BART, original Transformer

Best for: Translation, summarization

Encoder-Only (BERT-style)

Key Characteristic: Bidirectional self-attention. Every token can attend to every other token.
# Encoder block for each token: attend to ALL tokens (past and future)

Use Cases

Training

Typically trained with Masked Language Modeling (MLM):

Input: "The [MASK] sat on the mat" Target: "cat" Mask ~15% of tokens, predict them from context

Decoder-Only (GPT-style)

Key Characteristic: Causal (autoregressive) attention. Each token only attends to previous tokens.
# Decoder block with causal mask for token i: attend only to tokens 0...i (not future)

Why Causal?

For generation, we can't "see" the future tokens we haven't generated yet!

Generating "The cat sat": Step 1: Input "The" → Output "cat" Step 2: Input "The cat" → Output "sat" Step 3: Input "The cat sat" → Output "." At each step, can only see what came before

Use Cases

Training

Trained with Causal Language Modeling (next token prediction):

Input: "The cat sat" Target: "cat sat on" (shifted by 1) Predict next token given all previous tokens

Encoder-Decoder (T5-style)

Two separate transformers with cross-attention:

# Encoder: processes input encoder_output = encoder(input_tokens) # Decoder: generates output, attending to encoder decoder_output = decoder(output_tokens, encoder_output)

Cross-Attention

The decoder has an extra attention layer that attends to encoder outputs:

# Self-attention (causal) q = decoder_state @ W_q k = decoder_state @ W_k v = decoder_state @ W_v # Cross-attention (to encoder) q = decoder_state @ W_q k = encoder_output @ W_k v = encoder_output @ W_v

Use Cases

Comparison

Feature Encoder Decoder Enc-Dec
Attention Bidirectional Causal Both
Can generate? No Yes Yes
Training MLM CLM Seq2seq
Parameters ~110M (BERT) 175B (GPT-3) 11B (T5)

Why Decoder-Only Dominates

Most modern LLMs (GPT, Claude, LLaMA) are decoder-only. Why?

Trade-off: Decoder-only models can't do bidirectional understanding as naturally as encoders, but they're so large they learn to compensate.

Exercises

Exercise 1: Architecture Choice

For each task, which architecture is best?
a) Sentiment classification
b) Machine translation
c) Code completion
d) Named entity recognition

Exercise 2: Causal Masking

Why can't decoder-only models use bidirectional attention during training?

Exercise 3: Cross-Attention

What does cross-attention allow the decoder to do that self-attention doesn't?

📝 Knowledge Check

Question 1: Which architecture uses bidirectional attention?

  • A) Decoder-only models like GPT
  • B) Encoder-only models like BERT ✓
  • C) Neither uses bidirectional attention
  • D) Both use bidirectional attention equally

Explanation: Encoder-only models like BERT use bidirectional self-attention, meaning every token can attend to every other token in the input. This is why they're excellent for understanding tasks like classification and NER.

Question 2: Why do decoder-only models use causal (left-to-right) attention?

  • A) To reduce memory usage
  • B) To make training faster
  • C) They can't "see" future tokens they haven't generated yet ✓
  • D) Causal attention is more mathematically elegant

Explanation: During text generation, the model generates tokens one at a time. At each step, it can only attend to previously generated tokens because future tokens don't exist yet. Causal masking enforces this constraint.

Question 3: What is the purpose of cross-attention in encoder-decoder models?

  • A) To reduce the number of parameters
  • B) To allow the decoder to attend to encoder outputs ✓
  • C) To enable bidirectional attention in the decoder
  • D) To speed up inference

Explanation: Cross-attention layers allow the decoder to "look at" the encoder's representation of the input while generating output. This is crucial for tasks like translation where the output must be conditioned on the entire input sequence.

Question 4: Which training objective is used for encoder-only models like BERT?

  • A) Causal Language Modeling (CLM)
  • B) Sequence-to-sequence (Seq2seq)
  • C) Masked Language Modeling (MLM) ✓
  • D) Reinforcement Learning from Human Feedback (RLHF)

Explanation: BERT and similar encoder-only models are trained with Masked Language Modeling (MLM), where random tokens are masked and the model must predict them from surrounding context.