From Tokens to Numbers
In the previous lessons, we learned how text is split into tokens. But neural networks can't process "hello" or "world" β they need numbers. This is where the vocabulary comes in.
The Encoding Pipeline
Text β Tokens β Token IDs
"Hello world"
["Hello", " world"]
[15496, 995]
[[0.1, -0.3, ...], [...]]
Each token in the vocabulary has a unique integer ID. These IDs are used to look up the corresponding embedding vectors from the embedding matrix.
Vocabulary Size Trade-offs
Choosing vocabulary size is a crucial design decision. It affects:
- Memory: Larger vocabulary = larger embedding matrix
- Sequence length: Smaller vocabulary = more tokens per word
- Out-of-vocabulary rate: Larger vocabulary = fewer unknown tokens
- Computation: Larger vocabulary = more expensive softmax
Vocabulary Sizes in Popular Models
Why Not Just Use Characters?
You might wonder: why not just use 256 tokens (one per byte)? Then we'd never have unknown words!
"Hello world" as characters: 11 tokens
"Hello world" as BPE tokens: 2 tokens
Sequences become 5x longer, making training slower and context windows effectively 5x smaller.
The sweet spot is subword tokenization β common words get their own token, rare words get split into meaningful subpieces.
Special Tokens Explained
Every vocabulary includes special tokens that serve specific purposes:
| Token | ID (example) | Purpose |
|---|---|---|
| <|endoftext|> | 50256 | Marks end of document / separates documents |
| <|startoftext|> | 50257 | Marks beginning of generation |
| <pad|> | 50258 | Padding for batching variable-length sequences |
| <|unk|> | 50259 | Unknown token (rarely used with BPE) |
| <|im_start|> | 50260 | Start of instruction/chat message |
| <|im_end|> | 50261 | End of instruction/chat message |
Chat Template Example
These special tokens help the model understand the structure of the conversation and differentiate between system instructions, user messages, and assistant responses.
Encoding and Decoding
The tokenizer provides two key operations:
- encode(text) β token_ids: Convert text to integers
- decode(token_ids) β text: Convert integers back to text
Round-trip Encoding
A good tokenizer should satisfy: decode(encode(text)) β text
However, there can be minor differences due to:
- Normalization: Unicode normalization (Γ© vs e + Μ)
- Whitespace: Multiple spaces might collapse
- Special characters: Some rare characters might be replaced
Tokenization Quirks and Edge Cases
1. Leading Whitespace Matters
Most tokenizers treat "hello" and " hello" as different tokens:
This is why you'll see spaces attached to words in tokenized output.
2. Case Sensitivity
"Hello", "hello", and "HELLO" are usually different tokens:
3. Numbers Get Split
Numbers are tokenized based on their frequency in training data:
4. Unicode and Emojis
Unicode characters can explode into many tokens:
This is why using emojis in prompts can "waste" your context window!
Practice: Counting Tokens
Token Counter Exercise
Estimate how many tokens each phrase would be:
| Phrase | Characters | Your Guess | Actual Tokens |
|---|---|---|---|
| "The quick brown fox" | 19 | ? | ~4-5 |
| "uncharacteristically" | 20 | ? | ~3-4 |
| "1234567890" | 10 | ? | ~2-4 |
| "πππ" | 3 | ? | ~6-12 |
| "https://example.com" | 19 | ? | ~5-8 |
Rule of thumb: In English, tokens β 0.75 Γ words. So 100 tokens β 75 words.