Why Tokenization Matters
Tokenization is the invisible first step of every LLM. Before any neural network processing happens, your text must be converted to numbers. The quality of this conversion affects everything downstream.
The Tokenization Trade-off
Tokenization involves balancing three competing goals:
- Vocabulary size: Fewer tokens = simpler model, but more tokens needed per word
- Sequence length: Shorter sequences = faster processing, but less granularity
- Out-of-vocabulary handling: Must handle any input, including typos and rare words
Byte Pair Encoding (BPE) Algorithm
BPE was originally a compression algorithm, but OpenAI adapted it for tokenization in GPT-2. Here's how it works:
Training Phase
BPE Training Example
Training corpus: "low lower lowest"
Step 0: Initialize vocabulary with characters
Step 1: Count adjacent pairs
Step 2: Count again
Step 3: Continue merging
Continue until vocabulary reaches target size (e.g., 50,000 tokens)
Implementation
Special Tokens
Every tokenizer includes special tokens that serve specific purposes:
| Token | Purpose | Example Use |
|---|---|---|
| <|endoftext|> | Document separator / end marker | Separate training documents |
| <|startoftext|> | Beginning of sequence | Mark generation start |
| <|pad|> | Padding token | Fill variable-length sequences |
| <|unk|> | Unknown token | Replace out-of-vocabulary items |
| <|mask|> | Masked position | MLM training (BERT-style) |
| <|im_start|> | Instruction start | Chat/instruction templates |
| <|im_end|> | Instruction end | Chat/instruction templates |
<|im_start|>user<|im_sep|>Hello!<|im_end|><|im_start|>assistant<|im_sep|>Hi there!<|im_end|>
Tokenization Quirks
Tokenization has some surprising behaviors you should know about:
1. Leading Space
Most tokenizers attach leading spaces to words. This means "hello" and " hello" are different tokens!
2. Case Sensitivity
"Hello" and "hello" are usually different tokens. Some tokenizers use byte-level BPE to handle this more gracefully.
3. Number Splitting
Numbers get split in surprising ways. "123" might become "12" + "3" or "1" + "23" depending on the tokenizer.
4. Unicode Handling
Emojis and non-ASCII characters can explode into many tokens. A single emoji might become 5-10 tokens!
Token Count Comparison
| Input | Characters | Tokens (GPT-4) |
|---|---|---|
| "hello" | 5 | 1 |
| "Hello" | 5 | 1 |
| "12345" | 5 | 1 |
| "1234567890" | 10 | 2-3 |
| "😀" | 1 | 1-2 |
| "🎉" | 1 | 2-4 |
| "supercalifragilistic" | 20 | 3-5 |
Practice Exercises
Exercise 1: Manual BPE
Given the training text: "aa ab ab ac"
Run 3 merge operations of BPE. What tokens are in your vocabulary?
Exercise 2: Token Counting
Why might a model struggle with simple arithmetic like "123 + 456 = ?" Hint: Think about how numbers are tokenized.
Exercise 3: Prompt Engineering
You want the model to complete "The answer is" with a number. Why might "The answer is 123" work better than "The answer is"? (Think about leading spaces!)