Lesson 4: Positional Encoding

The Problem

Self-attention is permutation-invariant. If you shuffle the input tokens, you get the same output (just shuffled).

# Without positional information:
"The cat sat" → [embedding(The), embedding(cat), embedding(sat)]
"Sat cat The" → [embedding(sat), embedding(cat), embedding(The)]

# After attention, these would produce similar results!
# The model has no idea about word order.
      

        Critical Issue: Word order matters! "The dog bit the man" ≠ "The man bit the dog"
        We need to inject position information into the model.
      

Solution 1: Sinusoidal Positional Encoding

The original transformer uses fixed sinusoidal functions:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:
- pos = position in sequence (0, 1, 2, ...)
- i = dimension index
- d_model = model dimension
      

Intuition

Different dimensions get different frequencies:

Low dimensions: slowly varying (capture long-range patterns)
High dimensions: rapidly varying (capture fine-grained position)

Visualizing Sinusoidal Encoding

For d_model = 512, dimensions 0, 1, 100, 101 look like:

Dimension 0 (low freq): sin(pos/10000^0) = sin(pos)
Values: [0, 0.84, 0.91, 0.14, -0.76, ...]
Dimension 100 (high freq): sin(pos/10000^0.39)
Values: [0, 0.02, 0.04, 0.06, 0.08, ...] (slowly changing)

Each position gets a unique "fingerprint" pattern!

Why Sinusoids?

        Advantages:

        • Unique encoding: Each position has distinct pattern

        • Bounded values: sin/cos always in [-1, 1]

        • Relative positions: PE(pos+k) can be expressed as linear function of PE(pos)

        • Extrapolation: Works for sequence lengths not seen in training

Linear Relationship for Relative Positions

For any fixed offset k:

PE(pos + k) = f(PE(pos), PE(k))

# Where f is a linear transformation
# This allows the model to easily learn relative position relationships
      

Solution 2: Learned Positional Embeddings

Alternative: Just learn a separate embedding for each position:

# Learned positional embeddings
pos_embed = nn.Embedding(max_seq_len, d_model)

# For position i, lookup embedding i
position_i_embedding = pos_embed(i)
      

Comparison

Aspect	Sinusoidal	Learned
Parameters	0 (fixed)	max_seq_len × d_model
Longer sequences	✓ Extrapolates	✗ Can't extrapolate
Performance	Good	Slightly better
Used in	Original Transformer	BERT, GPT

Modern models (GPT, BERT) often use learned embeddings because they perform slightly better and max sequence length is known in advance.

Adding Position to Input

Positional encoding is added to token embeddings:

# Input embedding
X = token_embedding(tokens) + positional_encoding(positions)

# Or with learned embeddings:
X = token_embedding(tokens) + pos_embedding(positions)

# Then feed into transformer
output = transformer(X)
      

        Why add? Both token and position are important. Adding allows the model 
        to attend to both types of information. Alternative: concatenate (less common).
      

Modern Alternatives

Rotary Position Embedding (RoPE)

Used in LLaMA, PaLM. Rotates query/key vectors by position-dependent angles:

# Instead of adding position, rotate the vectors
q_rotated = rotate(q, position)
k_rotated = rotate(k, position)

# Attention naturally incorporates relative position
# because dot product captures rotation differences
      

ALiBi (Attention with Linear Biases)

Adds a penalty to attention scores based on distance:

# Add bias to attention scores
score = Q @ K^T + bias(distance)

# bias is negative and linear in distance
# Far away tokens get heavily penalized
      

ALiBi enables extrapolation to much longer sequences!

Exercises

Exercise 1: Position Calculation

Calculate PE(pos=1, i=0) with d_model=512. What is the value?

Exercise 2: Sequence Length

Why might learned positional embeddings struggle with sequences longer than max_seq_len?

Exercise 3: Relative Position

Why is it useful that PE(pos+k) can be written as a linear function of PE(pos)?