The Problem
Self-attention is permutation-invariant. If you shuffle the input tokens, you get the same output (just shuffled).
Solution 1: Sinusoidal Positional Encoding
The original transformer uses fixed sinusoidal functions:
Intuition
Different dimensions get different frequencies:
- Low dimensions: slowly varying (capture long-range patterns)
- High dimensions: rapidly varying (capture fine-grained position)
Visualizing Sinusoidal Encoding
For d_model = 512, dimensions 0, 1, 100, 101 look like:
Dimension 0 (low freq): sin(pos/10000^0) = sin(pos)
Values: [0, 0.84, 0.91, 0.14, -0.76, ...]
Dimension 100 (high freq): sin(pos/10000^0.39)
Values: [0, 0.02, 0.04, 0.06, 0.08, ...] (slowly changing)
Each position gets a unique "fingerprint" pattern!
Why Sinusoids?
• Unique encoding: Each position has distinct pattern
• Bounded values: sin/cos always in [-1, 1]
• Relative positions: PE(pos+k) can be expressed as linear function of PE(pos)
• Extrapolation: Works for sequence lengths not seen in training
Linear Relationship for Relative Positions
For any fixed offset k:
Solution 2: Learned Positional Embeddings
Alternative: Just learn a separate embedding for each position:
Comparison
| Aspect | Sinusoidal | Learned |
|---|---|---|
| Parameters | 0 (fixed) | max_seq_len × d_model |
| Longer sequences | ✓ Extrapolates | ✗ Can't extrapolate |
| Performance | Good | Slightly better |
| Used in | Original Transformer | BERT, GPT |
Modern models (GPT, BERT) often use learned embeddings because they perform slightly better and max sequence length is known in advance.
Adding Position to Input
Positional encoding is added to token embeddings:
Modern Alternatives
Rotary Position Embedding (RoPE)
Used in LLaMA, PaLM. Rotates query/key vectors by position-dependent angles:
ALiBi (Attention with Linear Biases)
Adds a penalty to attention scores based on distance:
ALiBi enables extrapolation to much longer sequences!
Exercises
Exercise 1: Position Calculation
Calculate PE(pos=1, i=0) with d_model=512. What is the value?
Exercise 2: Sequence Length
Why might learned positional embeddings struggle with sequences longer than max_seq_len?
Exercise 3: Relative Position
Why is it useful that PE(pos+k) can be written as a linear function of PE(pos)?