Lesson 3: Vocabulary and Token IDs

From Tokens to Numbers

In the previous lessons, we learned how text is split into tokens. But neural networks can't process "hello" or "world" — they need numbers. This is where the vocabulary comes in.

        The Vocabulary: A fixed-size dictionary that maps each unique token to a unique integer ID. 
        Common vocabularies range from 32,000 to 200,000 tokens.
      

The Encoding Pipeline

Text → Tokens → Token IDs

Text
"Hello world"

→

Tokens
["Hello", " world"]

→

Token IDs
[15496, 995]

→

Embeddings
[[0.1, -0.3, ...], [...]]

Each token in the vocabulary has a unique integer ID. These IDs are used to look up the corresponding embedding vectors from the embedding matrix.

Vocabulary Size Trade-offs

Choosing vocabulary size is a crucial design decision. It affects:

Memory: Larger vocabulary = larger embedding matrix
Sequence length: Smaller vocabulary = more tokens per word
Out-of-vocabulary rate: Larger vocabulary = fewer unknown tokens
Computation: Larger vocabulary = more expensive softmax

Vocabulary Sizes in Popular Models

GPT-2
50,257

BERT
30,522

T5
32,128

GPT-4
100,256

LLaMA
32,000

Why Not Just Use Characters?

You might wonder: why not just use 256 tokens (one per byte)? Then we'd never have unknown words!

        The Character Problem:

        "Hello world" as characters: 11 tokens

        "Hello world" as BPE tokens: 2 tokens

        Sequences become 5x longer, making training slower and context windows effectively 5x smaller.

The sweet spot is subword tokenization — common words get their own token, rare words get split into meaningful subpieces.

Special Tokens Explained

Every vocabulary includes special tokens that serve specific purposes:

Token	ID (example)	Purpose
<\|endoftext\|>	50256	Marks end of document / separates documents
<\|startoftext\|>	50257	Marks beginning of generation
<pad\|>	50258	Padding for batching variable-length sequences
<\|unk\|>	50259	Unknown token (rarely used with BPE)
<\|im_start\|>	50260	Start of instruction/chat message
<\|im_end\|>	50261	End of instruction/chat message

Chat Template Example

# Modern LLMs use special tokens to structure conversations

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>
      

These special tokens help the model understand the structure of the conversation and differentiate between system instructions, user messages, and assistant responses.

Encoding and Decoding

The tokenizer provides two key operations:

encode(text) → token_ids: Convert text to integers
decode(token_ids) → text: Convert integers back to text

# Using a tokenizer (conceptual)

# Encode: text → tokens → token IDs
text = "Hello world!"
tokens = tokenizer.tokenize(text)
# ['Hello', ' world', '!']

token_ids = tokenizer.encode(text)
# [15496, 995, 0]  (example IDs)

# Decode: token IDs → tokens → text
decoded = tokenizer.decode(token_ids)
# "Hello world!"

# Note: decode(encode(text)) should equal text
# (with some exceptions for normalization)
      

Round-trip Encoding

A good tokenizer should satisfy: decode(encode(text)) ≈ text

However, there can be minor differences due to:

Normalization: Unicode normalization (é vs e + ́)
Whitespace: Multiple spaces might collapse
Special characters: Some rare characters might be replaced

Tokenization Quirks and Edge Cases

1. Leading Whitespace Matters

Most tokenizers treat "hello" and " hello" as different tokens:

# In GPT-2 tokenizer:
"hello"   → [31373]      # single token
" hello"  → [2031]       # different token!
      

This is why you'll see spaces attached to words in tokenized output.

2. Case Sensitivity

"Hello", "hello", and "HELLO" are usually different tokens:

"Hello"   → [15496]
"hello"   → [31373]
"HELLO"   → [21891]
      

3. Numbers Get Split

Numbers are tokenized based on their frequency in training data:

"1"       → [16]         # single token
"12"      → [530]        # single token
"123"     → [1613]       # single token
"1234"    → [12983]      # might be split: [12, 34]
      

        Why this matters: LLMs can struggle with arithmetic because numbers get split 
        unpredictably. "123 + 456" might be tokenized as ["12", "3", " +", " 45", "6"], making it 
        harder to learn mathematical patterns.
      

4. Unicode and Emojis

Unicode characters can explode into many tokens:

"🎉"      → [4 tokens]   # party popper
"👨‍👩‍👧‍👦"    → [10+ tokens] # family emoji (multiple components)
"中文"    → [varies]     # Chinese characters
      

This is why using emojis in prompts can "waste" your context window!

Practice: Counting Tokens

Token Counter Exercise

Estimate how many tokens each phrase would be:

Phrase	Characters	Your Guess	Actual Tokens
"The quick brown fox"	19	?	~4-5
"uncharacteristically"	20	?	~3-4
"1234567890"	10	?	~2-4
"🎉🎊🎈"	3	?	~6-12
"https://example.com"	19	?	~5-8

Rule of thumb: In English, tokens ≈ 0.75 × words. So 100 tokens ≈ 75 words.