Lesson 2: Embeddings in Detail

What Are Embeddings Really?

An embedding is a mapping from discrete tokens to continuous vectors. But what do these vectors mean? Why do they capture semantic relationships?

        The Distributional Hypothesis: Words that appear in similar contexts tend to have 
        similar meanings. Embeddings capture this by placing similar words close together in vector space.
      

Visualizing Embeddings

Real embeddings have hundreds or thousands of dimensions, but we can visualize the concept in 2D. Imagine words arranged in space such that:

Synonyms are close together
Related concepts form clusters
Directions in space represent semantic relationships

2D Embedding Visualization (Simplified)

Words cluster by meaning. Directions capture relationships (gender, plural, etc.)

The Famous Word Analogies

Perhaps the most striking demonstration of embeddings is word analogies:

king − man + woman ≈ queen

"Man is to woman as king is to queen"

This works because embeddings capture semantic relationships as vector directions. The "gender" direction is roughly the same for all word pairs:

king → queen (gender + royalty)
man → woman (gender)
actor → actress (gender)
prince → princess (gender + royalty)

More Analogy Examples

Paris − France + Italy ≈ Rome (capital cities)

walking − walk + swim ≈ swimming (verb → gerund)

bigger − big + small ≈ smaller (comparative)

        Why does this work? Training embeddings to predict context forces them to encode 
        semantic relationships. Linear structure emerges because the prediction task is linear in nature 
        (roughly: "what word is likely to appear near X?").
      

Measuring Similarity: Cosine Distance

How do we quantify how similar two embeddings are? We use cosine similarity:

similarity(u, v) = (u \cdot v) / (||u|| ||v||) = cos(θ)

This measures the cosine of the angle between two vectors, ignoring their magnitude:

1.0 = identical direction (same meaning)
0.0 = orthogonal (unrelated)
-1.0 = opposite directions (antonyms)

Cosine Similarity Examples

Word Pair	Similarity	Relationship
king ↔ queen	0.72	Related (royalty)
king ↔ man	0.58	Related (male)
king ↔ apple	0.12	Unrelated
hot ↔ cold	0.45	Antonyms (but related)
cat ↔ feline	0.81	Synonyms

Why Cosine and Not Euclidean?

We use cosine similarity because:

Direction matters more than magnitude: "King" and "kings" should be similar despite different frequencies
Normalization: Common words don't dominate just because they have larger vectors
Interpretability: Cosine directly relates to the angle between meanings

Contextual Embeddings: The Game Changer

Traditional embeddings (Word2Vec, GloVe) give each word a single vector. But words have different meanings in different contexts:

"I went to the bank to deposit money" (financial)
"I sat by the river bank" (geographic)
"The plane banked sharply" (verb)

Modern LLMs use contextual embeddings: the vector for "bank" depends on the surrounding words. This is a key innovation of transformer models.

        Key Difference:

        • Static embeddings: word → vector (one-to-one)

        • Contextual embeddings: (word, context) → vector (many-to-many)

How Contextual Embeddings Work

In transformers, each token's representation is computed by attending to all other tokens in the sequence. We'll cover the details in Level 3, but the intuition is:

Start with static embeddings (like we discussed)
Let each token "look at" other tokens in the sentence
Update embeddings based on what would help predict each token's context
"Bank" near "money" gets financial features; "bank" near "river" gets geographic features

Implementing an Embedding Layer

In code, an embedding layer is just a lookup table:

import torch
import torch.nn as nn

# Create an embedding layer
# vocab_size = 50000, embedding_dim = 768
embedding = nn.Embedding(50000, 768)

# Input: token indices (batch_size=2, seq_len=3)
input_tokens = torch.tensor([[1, 45, 892],
                             [23, 1, 3345]])

# Output: embedded vectors (2, 3, 768)
embedded = embedding(input_tokens)
print(embedded.shape)  # torch.Size([2, 3, 768])

# Each token index maps to a row in the embedding matrix
# embedding.weight has shape (50000, 768)
print(embedding.weight.shape)  # torch.Size([50000, 768])
      

The embedding matrix is learned during training. Initially random, it gradually organizes itself to place similar words near each other.

Pre-trained Embeddings

You don't always need to train embeddings from scratch. Popular pre-trained options:

GloVe: Trained on word co-occurrence statistics
Word2Vec: Skip-gram or CBOW training
FastText: Handles out-of-vocabulary words via subword units
BERT/GPT embeddings: Contextual embeddings from transformer models