🚧 Lesson 3 of 10 in Level 01
Level 01 β€’ Lesson 3

Vocabulary and Token IDs

Understanding how tokens map to integers, vocabulary size trade-offs, and the encoding/decoding process.

From Tokens to Numbers

In the previous lessons, we learned how text is split into tokens. But neural networks can't process "hello" or "world" β€” they need numbers. This is where the vocabulary comes in.

The Vocabulary: A fixed-size dictionary that maps each unique token to a unique integer ID. Common vocabularies range from 32,000 to 200,000 tokens.

The Encoding Pipeline

Text β†’ Tokens β†’ Token IDs

Text
"Hello world"
β†’
Tokens
["Hello", " world"]
β†’
Token IDs
[15496, 995]
β†’
Embeddings
[[0.1, -0.3, ...], [...]]

Each token in the vocabulary has a unique integer ID. These IDs are used to look up the corresponding embedding vectors from the embedding matrix.

Vocabulary Size Trade-offs

Choosing vocabulary size is a crucial design decision. It affects:

Vocabulary Sizes in Popular Models

GPT-2
50,257
BERT
30,522
T5
32,128
GPT-4
100,256
LLaMA
32,000

Why Not Just Use Characters?

You might wonder: why not just use 256 tokens (one per byte)? Then we'd never have unknown words!

The Character Problem:
"Hello world" as characters: 11 tokens
"Hello world" as BPE tokens: 2 tokens

Sequences become 5x longer, making training slower and context windows effectively 5x smaller.

The sweet spot is subword tokenization β€” common words get their own token, rare words get split into meaningful subpieces.

Special Tokens Explained

Every vocabulary includes special tokens that serve specific purposes:

Token ID (example) Purpose
<|endoftext|> 50256 Marks end of document / separates documents
<|startoftext|> 50257 Marks beginning of generation
<pad|> 50258 Padding for batching variable-length sequences
<|unk|> 50259 Unknown token (rarely used with BPE)
<|im_start|> 50260 Start of instruction/chat message
<|im_end|> 50261 End of instruction/chat message

Chat Template Example

# Modern LLMs use special tokens to structure conversations <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user What is the capital of France?<|im_end|> <|im_start|>assistant The capital of France is Paris.<|im_end|>

These special tokens help the model understand the structure of the conversation and differentiate between system instructions, user messages, and assistant responses.

Encoding and Decoding

The tokenizer provides two key operations:

# Using a tokenizer (conceptual) # Encode: text β†’ tokens β†’ token IDs text = "Hello world!" tokens = tokenizer.tokenize(text) # ['Hello', ' world', '!'] token_ids = tokenizer.encode(text) # [15496, 995, 0] (example IDs) # Decode: token IDs β†’ tokens β†’ text decoded = tokenizer.decode(token_ids) # "Hello world!" # Note: decode(encode(text)) should equal text # (with some exceptions for normalization)

Round-trip Encoding

A good tokenizer should satisfy: decode(encode(text)) β‰ˆ text

However, there can be minor differences due to:

Tokenization Quirks and Edge Cases

1. Leading Whitespace Matters

Most tokenizers treat "hello" and " hello" as different tokens:

# In GPT-2 tokenizer: "hello" β†’ [31373] # single token " hello" β†’ [2031] # different token!

This is why you'll see spaces attached to words in tokenized output.

2. Case Sensitivity

"Hello", "hello", and "HELLO" are usually different tokens:

"Hello" β†’ [15496] "hello" β†’ [31373] "HELLO" β†’ [21891]

3. Numbers Get Split

Numbers are tokenized based on their frequency in training data:

"1" β†’ [16] # single token "12" β†’ [530] # single token "123" β†’ [1613] # single token "1234" β†’ [12983] # might be split: [12, 34]
Why this matters: LLMs can struggle with arithmetic because numbers get split unpredictably. "123 + 456" might be tokenized as ["12", "3", " +", " 45", "6"], making it harder to learn mathematical patterns.

4. Unicode and Emojis

Unicode characters can explode into many tokens:

"πŸŽ‰" β†’ [4 tokens] # party popper "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦" β†’ [10+ tokens] # family emoji (multiple components) "δΈ­ζ–‡" β†’ [varies] # Chinese characters

This is why using emojis in prompts can "waste" your context window!

Practice: Counting Tokens

Token Counter Exercise

Estimate how many tokens each phrase would be:

Phrase Characters Your Guess Actual Tokens
"The quick brown fox" 19 ? ~4-5
"uncharacteristically" 20 ? ~3-4
"1234567890" 10 ? ~2-4
"πŸŽ‰πŸŽŠπŸŽˆ" 3 ? ~6-12
"https://example.com" 19 ? ~5-8

Rule of thumb: In English, tokens β‰ˆ 0.75 Γ— words. So 100 tokens β‰ˆ 75 words.