The Intuition
Self-attention allows each token to "look at" all other tokens in the sequence and decide which ones are relevant.
Example Sentence
"The cat sat on the mat because it was tired."
When processing the word "it", what should it attend to?
- "cat" - high relevance ("it" refers to cat)
- "tired" - high relevance (describes the state)
- "mat" - low relevance
- "the", "on" - very low relevance
Self-attention learns these relevance scores automatically!
The Three Vectors: Q, K, V
For each token, we create three vectors:
Query, Key, Value
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
Analogy: Database Lookup
• Query = Your search query
• Key = Database index/key
• Value = The actual data/content
Attention = Match queries to keys, retrieve corresponding values
Computing Attention Scores
Step 1: Dot Product (Similarity)
For each pair of tokens, compute how well the Query matches the Key:
Step 2: Scale
Divide by √d_k to prevent dot products from getting too large:
Step 3: Softmax
Convert scores to probabilities (sum to 1):
Step 4: Weighted Sum
Multiply attention weights by Values:
Complete Example
Sentence: "The cat sat"
3 tokens: ["The", "cat", "sat"]
Step 1: Compute Q, K, V
Step 2: Attention Scores (Q @ K^T)
Step 3: After Softmax
Notice: "cat" pays most attention to itself (0.50), but also to "sat" (0.22)
Matrix Form
In practice, we compute attention for the entire sequence at once:
Why This Works
Properties of Self-Attention
1. Long-range dependencies: Any token can directly attend to any other token, regardless of distance. No vanishing gradients through time!
2. Parallel computation: Unlike RNNs, all attention scores can be computed simultaneously. Much faster on GPUs.
3. Interpretable: We can visualize attention weights to see what the model is "looking at."
4. Content-based: Attention depends on the actual content of tokens, not just their position.
Exercises
Exercise 1: Attention Shape
For a sequence of length 100 with d_k = 64, what are the shapes of:
- Q, K, V matrices?
- The attention score matrix?
- The output?
Exercise 2: Attention Weights
If attention_weights[5, 10] = 0.3, what does this mean in plain English?
Exercise 3: Complexity
What is the computational complexity of self-attention in terms of sequence length n? Why might this be a problem for very long sequences?