๐Ÿšง Lesson 2 of 10 in Level 01
Level 01 โ€ข Lesson 2

Embeddings in Detail

Understanding how embeddings capture meaning, word analogies, and the geometry of semantic space.

What Are Embeddings Really?

An embedding is a mapping from discrete tokens to continuous vectors. But what do these vectors mean? Why do they capture semantic relationships?

The Distributional Hypothesis: Words that appear in similar contexts tend to have similar meanings. Embeddings capture this by placing similar words close together in vector space.

Visualizing Embeddings

Real embeddings have hundreds or thousands of dimensions, but we can visualize the concept in 2D. Imagine words arranged in space such that:

2D Embedding Visualization (Simplified)

king queen prince princess gender cat dog pet pizza pasta food

Words cluster by meaning. Directions capture relationships (gender, plural, etc.)

The Famous Word Analogies

Perhaps the most striking demonstration of embeddings is word analogies:

king โˆ’ man + woman โ‰ˆ queen

"Man is to woman as king is to queen"

This works because embeddings capture semantic relationships as vector directions. The "gender" direction is roughly the same for all word pairs:

More Analogy Examples

Paris โˆ’ France + Italy โ‰ˆ Rome (capital cities)
walking โˆ’ walk + swim โ‰ˆ swimming (verb โ†’ gerund)
bigger โˆ’ big + small โ‰ˆ smaller (comparative)
Why does this work? Training embeddings to predict context forces them to encode semantic relationships. Linear structure emerges because the prediction task is linear in nature (roughly: "what word is likely to appear near X?").

Measuring Similarity: Cosine Distance

How do we quantify how similar two embeddings are? We use cosine similarity:

similarity(u, v) = (u ยท v) / (||u|| ||v||) = cos(ฮธ)

This measures the cosine of the angle between two vectors, ignoring their magnitude:

Cosine Similarity Examples

Word Pair Similarity Relationship
king โ†” queen 0.72 Related (royalty)
king โ†” man 0.58 Related (male)
king โ†” apple 0.12 Unrelated
hot โ†” cold 0.45 Antonyms (but related)
cat โ†” feline 0.81 Synonyms

Why Cosine and Not Euclidean?

We use cosine similarity because:

  1. Direction matters more than magnitude: "King" and "kings" should be similar despite different frequencies
  2. Normalization: Common words don't dominate just because they have larger vectors
  3. Interpretability: Cosine directly relates to the angle between meanings

Contextual Embeddings: The Game Changer

Traditional embeddings (Word2Vec, GloVe) give each word a single vector. But words have different meanings in different contexts:

Modern LLMs use contextual embeddings: the vector for "bank" depends on the surrounding words. This is a key innovation of transformer models.

Key Difference:
โ€ข Static embeddings: word โ†’ vector (one-to-one)
โ€ข Contextual embeddings: (word, context) โ†’ vector (many-to-many)

How Contextual Embeddings Work

In transformers, each token's representation is computed by attending to all other tokens in the sequence. We'll cover the details in Level 3, but the intuition is:

  1. Start with static embeddings (like we discussed)
  2. Let each token "look at" other tokens in the sentence
  3. Update embeddings based on what would help predict each token's context
  4. "Bank" near "money" gets financial features; "bank" near "river" gets geographic features

Implementing an Embedding Layer

In code, an embedding layer is just a lookup table:

import torch import torch.nn as nn # Create an embedding layer # vocab_size = 50000, embedding_dim = 768 embedding = nn.Embedding(50000, 768) # Input: token indices (batch_size=2, seq_len=3) input_tokens = torch.tensor([[1, 45, 892], [23, 1, 3345]]) # Output: embedded vectors (2, 3, 768) embedded = embedding(input_tokens) print(embedded.shape) # torch.Size([2, 3, 768]) # Each token index maps to a row in the embedding matrix # embedding.weight has shape (50000, 768) print(embedding.weight.shape) # torch.Size([50000, 768])

The embedding matrix is learned during training. Initially random, it gradually organizes itself to place similar words near each other.

Pre-trained Embeddings

You don't always need to train embeddings from scratch. Popular pre-trained options: