🚧 Lesson 4 of 10 in Level 03
Level 03 • Lesson 4

Positional Encoding

How transformers know about word order. Sinusoidal encodings and learned embeddings.

The Problem

Self-attention is permutation-invariant. If you shuffle the input tokens, you get the same output (just shuffled).

# Without positional information: "The cat sat" → [embedding(The), embedding(cat), embedding(sat)] "Sat cat The" → [embedding(sat), embedding(cat), embedding(The)] # After attention, these would produce similar results! # The model has no idea about word order.
Critical Issue: Word order matters! "The dog bit the man" ≠ "The man bit the dog" We need to inject position information into the model.

Solution 1: Sinusoidal Positional Encoding

The original transformer uses fixed sinusoidal functions:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) Where: - pos = position in sequence (0, 1, 2, ...) - i = dimension index - d_model = model dimension

Intuition

Different dimensions get different frequencies:

Visualizing Sinusoidal Encoding

For d_model = 512, dimensions 0, 1, 100, 101 look like:

Dimension 0 (low freq): sin(pos/10000^0) = sin(pos)

Values: [0, 0.84, 0.91, 0.14, -0.76, ...]

Dimension 100 (high freq): sin(pos/10000^0.39)

Values: [0, 0.02, 0.04, 0.06, 0.08, ...] (slowly changing)

Each position gets a unique "fingerprint" pattern!

Why Sinusoids?

Advantages:
Unique encoding: Each position has distinct pattern
Bounded values: sin/cos always in [-1, 1]
Relative positions: PE(pos+k) can be expressed as linear function of PE(pos)
Extrapolation: Works for sequence lengths not seen in training

Linear Relationship for Relative Positions

For any fixed offset k:

PE(pos + k) = f(PE(pos), PE(k)) # Where f is a linear transformation # This allows the model to easily learn relative position relationships

Solution 2: Learned Positional Embeddings

Alternative: Just learn a separate embedding for each position:

# Learned positional embeddings pos_embed = nn.Embedding(max_seq_len, d_model) # For position i, lookup embedding i position_i_embedding = pos_embed(i)

Comparison

Aspect Sinusoidal Learned
Parameters 0 (fixed) max_seq_len × d_model
Longer sequences ✓ Extrapolates ✗ Can't extrapolate
Performance Good Slightly better
Used in Original Transformer BERT, GPT

Modern models (GPT, BERT) often use learned embeddings because they perform slightly better and max sequence length is known in advance.

Adding Position to Input

Positional encoding is added to token embeddings:

# Input embedding X = token_embedding(tokens) + positional_encoding(positions) # Or with learned embeddings: X = token_embedding(tokens) + pos_embedding(positions) # Then feed into transformer output = transformer(X)
Why add? Both token and position are important. Adding allows the model to attend to both types of information. Alternative: concatenate (less common).

Modern Alternatives

Rotary Position Embedding (RoPE)

Used in LLaMA, PaLM. Rotates query/key vectors by position-dependent angles:

# Instead of adding position, rotate the vectors q_rotated = rotate(q, position) k_rotated = rotate(k, position) # Attention naturally incorporates relative position # because dot product captures rotation differences

ALiBi (Attention with Linear Biases)

Adds a penalty to attention scores based on distance:

# Add bias to attention scores score = Q @ K^T + bias(distance) # bias is negative and linear in distance # Far away tokens get heavily penalized

ALiBi enables extrapolation to much longer sequences!

Exercises

Exercise 1: Position Calculation

Calculate PE(pos=1, i=0) with d_model=512. What is the value?

Exercise 2: Sequence Length

Why might learned positional embeddings struggle with sequences longer than max_seq_len?

Exercise 3: Relative Position

Why is it useful that PE(pos+k) can be written as a linear function of PE(pos)?