What Are Embeddings Really?
An embedding is a mapping from discrete tokens to continuous vectors. But what do these vectors mean? Why do they capture semantic relationships?
Visualizing Embeddings
Real embeddings have hundreds or thousands of dimensions, but we can visualize the concept in 2D. Imagine words arranged in space such that:
- Synonyms are close together
- Related concepts form clusters
- Directions in space represent semantic relationships
2D Embedding Visualization (Simplified)
Words cluster by meaning. Directions capture relationships (gender, plural, etc.)
The Famous Word Analogies
Perhaps the most striking demonstration of embeddings is word analogies:
"Man is to woman as king is to queen"
This works because embeddings capture semantic relationships as vector directions. The "gender" direction is roughly the same for all word pairs:
- king โ queen (gender + royalty)
- man โ woman (gender)
- actor โ actress (gender)
- prince โ princess (gender + royalty)
More Analogy Examples
Measuring Similarity: Cosine Distance
How do we quantify how similar two embeddings are? We use cosine similarity:
This measures the cosine of the angle between two vectors, ignoring their magnitude:
- 1.0 = identical direction (same meaning)
- 0.0 = orthogonal (unrelated)
- -1.0 = opposite directions (antonyms)
Cosine Similarity Examples
| Word Pair | Similarity | Relationship |
|---|---|---|
| king โ queen | 0.72 | Related (royalty) |
| king โ man | 0.58 | Related (male) |
| king โ apple | 0.12 | Unrelated |
| hot โ cold | 0.45 | Antonyms (but related) |
| cat โ feline | 0.81 | Synonyms |
Why Cosine and Not Euclidean?
We use cosine similarity because:
- Direction matters more than magnitude: "King" and "kings" should be similar despite different frequencies
- Normalization: Common words don't dominate just because they have larger vectors
- Interpretability: Cosine directly relates to the angle between meanings
Contextual Embeddings: The Game Changer
Traditional embeddings (Word2Vec, GloVe) give each word a single vector. But words have different meanings in different contexts:
- "I went to the bank to deposit money" (financial)
- "I sat by the river bank" (geographic)
- "The plane banked sharply" (verb)
Modern LLMs use contextual embeddings: the vector for "bank" depends on the surrounding words. This is a key innovation of transformer models.
โข Static embeddings: word โ vector (one-to-one)
โข Contextual embeddings: (word, context) โ vector (many-to-many)
How Contextual Embeddings Work
In transformers, each token's representation is computed by attending to all other tokens in the sequence. We'll cover the details in Level 3, but the intuition is:
- Start with static embeddings (like we discussed)
- Let each token "look at" other tokens in the sentence
- Update embeddings based on what would help predict each token's context
- "Bank" near "money" gets financial features; "bank" near "river" gets geographic features
Implementing an Embedding Layer
In code, an embedding layer is just a lookup table:
The embedding matrix is learned during training. Initially random, it gradually organizes itself to place similar words near each other.
Pre-trained Embeddings
You don't always need to train embeddings from scratch. Popular pre-trained options:
- GloVe: Trained on word co-occurrence statistics
- Word2Vec: Skip-gram or CBOW training
- FastText: Handles out-of-vocabulary words via subword units
- BERT/GPT embeddings: Contextual embeddings from transformer models