Lesson 6: Encoder vs Decoder

Three Variants

The original "Attention Is All You Need" paper introduced an encoder-decoder architecture. Since then, two other variants have become popular:

Encoder-Only

Bidirectional attention. Can see entire input at once.

Examples: BERT, RoBERTa

Best for: Understanding tasks (classification, NER)

Decoder-Only

Causal (left-to-right) attention. Autoregressive generation.

Examples: GPT, LLaMA, Claude

Best for: Generation (text completion, chat)

Encoder-Decoder

Separate encoder and decoder. Cross-attention between them.

Examples: T5, BART, original Transformer

Best for: Translation, summarization

Encoder-Only (BERT-style)

        Key Characteristic: Bidirectional self-attention. Every token can attend to every other token.
      

# Encoder block
for each token:
    attend to ALL tokens (past and future)
      

Use Cases

Text classification (sentiment, topic)
Named Entity Recognition (NER)
Question answering (extractive)
Sentence embeddings

Training

Typically trained with Masked Language Modeling (MLM):

Input:  "The [MASK] sat on the mat"
Target: "cat"

Mask ~15% of tokens, predict them from context
      

Decoder-Only (GPT-style)

        Key Characteristic: Causal (autoregressive) attention. Each token only attends to previous tokens.
      

# Decoder block with causal mask
for token i:
    attend only to tokens 0...i (not future)
      

Why Causal?

For generation, we can't "see" the future tokens we haven't generated yet!

Generating "The cat sat":
Step 1: Input "The" → Output "cat"
Step 2: Input "The cat" → Output "sat"
Step 3: Input "The cat sat" → Output "."

At each step, can only see what came before
      

Use Cases

Text generation/completion
Chatbots and conversational AI
Code generation
Creative writing

Training

Trained with Causal Language Modeling (next token prediction):

Input:  "The cat sat"
Target: "cat sat on" (shifted by 1)

Predict next token given all previous tokens
      

Encoder-Decoder (T5-style)

Two separate transformers with cross-attention:

# Encoder: processes input
encoder_output = encoder(input_tokens)

# Decoder: generates output, attending to encoder
decoder_output = decoder(output_tokens, encoder_output)
      

Cross-Attention

The decoder has an extra attention layer that attends to encoder outputs:

# Self-attention (causal)
q = decoder_state @ W_q
k = decoder_state @ W_k
v = decoder_state @ W_v

# Cross-attention (to encoder)
q = decoder_state @ W_q
k = encoder_output @ W_k
v = encoder_output @ W_v
      

Use Cases

Machine translation
Summarization
Question answering (generative)
Text-to-text tasks

Comparison

Feature	Encoder	Decoder	Enc-Dec
Attention	Bidirectional	Causal	Both
Can generate?	No	Yes	Yes
Training	MLM	CLM	Seq2seq
Parameters	~110M (BERT)	175B (GPT-3)	11B (T5)

Why Decoder-Only Dominates

Most modern LLMs (GPT, Claude, LLaMA) are decoder-only. Why?

Simplicity: Single stack, easier to scale
Efficiency: Can use KV caching for fast generation
Universality: Can do most tasks via prompting
Scales better: Empirically works best at large scale

        Trade-off: Decoder-only models can't do bidirectional understanding 
        as naturally as encoders, but they're so large they learn to compensate.
      

Exercises

Exercise 1: Architecture Choice

For each task, which architecture is best?
a) Sentiment classification
b) Machine translation
c) Code completion
d) Named entity recognition

Exercise 2: Causal Masking

Why can't decoder-only models use bidirectional attention during training?

Exercise 3: Cross-Attention

What does cross-attention allow the decoder to do that self-attention doesn't?

📝 Knowledge Check

Question 1: Which architecture uses bidirectional attention?

A) Decoder-only models like GPT
B) Encoder-only models like BERT ✓
C) Neither uses bidirectional attention
D) Both use bidirectional attention equally

Explanation: Encoder-only models like BERT use bidirectional self-attention, meaning every token can attend to every other token in the input. This is why they're excellent for understanding tasks like classification and NER.

Question 2: Why do decoder-only models use causal (left-to-right) attention?

A) To reduce memory usage
B) To make training faster
C) They can't "see" future tokens they haven't generated yet ✓
D) Causal attention is more mathematically elegant

Explanation: During text generation, the model generates tokens one at a time. At each step, it can only attend to previously generated tokens because future tokens don't exist yet. Causal masking enforces this constraint.

Question 3: What is the purpose of cross-attention in encoder-decoder models?

A) To reduce the number of parameters
B) To allow the decoder to attend to encoder outputs ✓
C) To enable bidirectional attention in the decoder
D) To speed up inference

Explanation: Cross-attention layers allow the decoder to "look at" the encoder's representation of the input while generating output. This is crucial for tasks like translation where the output must be conditioned on the entire input sequence.

Question 4: Which training objective is used for encoder-only models like BERT?

A) Causal Language Modeling (CLM)
B) Sequence-to-sequence (Seq2seq)
C) Masked Language Modeling (MLM) ✓
D) Reinforcement Learning from Human Feedback (RLHF)

Explanation: BERT and similar encoder-only models are trained with Masked Language Modeling (MLM), where random tokens are masked and the model must predict them from surrounding context.