🚧 Lesson 1 of 10 in Level 01
Level 01 • Lesson 1

Tokenization Deep Dive

Understanding Byte Pair Encoding, special tokens, and how to implement a tokenizer from scratch.

Why Tokenization Matters

Tokenization is the invisible first step of every LLM. Before any neural network processing happens, your text must be converted to numbers. The quality of this conversion affects everything downstream.

Key Insight: Different tokenizers produce different sequences from the same text. This means the same model architecture can behave differently depending on how it tokenizes.

The Tokenization Trade-off

Tokenization involves balancing three competing goals:

  1. Vocabulary size: Fewer tokens = simpler model, but more tokens needed per word
  2. Sequence length: Shorter sequences = faster processing, but less granularity
  3. Out-of-vocabulary handling: Must handle any input, including typos and rare words

Byte Pair Encoding (BPE) Algorithm

BPE was originally a compression algorithm, but OpenAI adapted it for tokenization in GPT-2. Here's how it works:

Training Phase

BPE Training Example

Training corpus: "low lower lowest"

Step 0: Initialize vocabulary with characters

l o w e r s t

Step 1: Count adjacent pairs

"l o" appears 3 times Merge "lo"
lo w e r s t

Step 2: Count again

"lo w" appears 3 times Merge "low"
low e r s t

Step 3: Continue merging

"low e" appears 2 times Merge "lowe"
lowe r s t

Continue until vocabulary reaches target size (e.g., 50,000 tokens)

Implementation

def train_bpe(texts, vocab_size, special_tokens): # Initialize vocabulary with all characters vocab = set(''.join(texts)) vocab.update(special_tokens) # Initialize merges list merges = [] # Convert texts to initial tokenization (character level) tokenized = [[list(word) + [''] for word in text.split()] for text in texts] while len(vocab) < vocab_size: # Count all adjacent pairs pairs = get_pair_counts(tokenized) if not pairs: break # Find most frequent pair best_pair = max(pairs, key=pairs.get) # Merge all occurrences tokenized = merge_all(tokenized, best_pair) # Add to vocabulary and merges new_token = best_pair[0] + best_pair[1] vocab.add(new_token) merges.append(best_pair) return vocab, merges def encode(text, vocab, merges): # Start with character-level tokenization tokens = list(text) # Apply merges in order for merge in merges: tokens = apply_merge(tokens, merge) return tokens

Special Tokens

Every tokenizer includes special tokens that serve specific purposes:

Token Purpose Example Use
<|endoftext|> Document separator / end marker Separate training documents
<|startoftext|> Beginning of sequence Mark generation start
<|pad|> Padding token Fill variable-length sequences
<|unk|> Unknown token Replace out-of-vocabulary items
<|mask|> Masked position MLM training (BERT-style)
<|im_start|> Instruction start Chat/instruction templates
<|im_end|> Instruction end Chat/instruction templates
Chat Templates: Modern LLMs use special tokens to structure conversations:

<|im_start|>user<|im_sep|>Hello!<|im_end|><|im_start|>assistant<|im_sep|>Hi there!<|im_end|>

Tokenization Quirks

Tokenization has some surprising behaviors you should know about:

1. Leading Space

Most tokenizers attach leading spaces to words. This means "hello" and " hello" are different tokens!

2. Case Sensitivity

"Hello" and "hello" are usually different tokens. Some tokenizers use byte-level BPE to handle this more gracefully.

3. Number Splitting

Numbers get split in surprising ways. "123" might become "12" + "3" or "1" + "23" depending on the tokenizer.

4. Unicode Handling

Emojis and non-ASCII characters can explode into many tokens. A single emoji might become 5-10 tokens!

Token Count Comparison

Input Characters Tokens (GPT-4)
"hello" 5 1
"Hello" 5 1
"12345" 5 1
"1234567890" 10 2-3
"😀" 1 1-2
"🎉" 1 2-4
"supercalifragilistic" 20 3-5

Practice Exercises

Exercise 1: Manual BPE

Given the training text: "aa ab ab ac"
Run 3 merge operations of BPE. What tokens are in your vocabulary?

Exercise 2: Token Counting

Why might a model struggle with simple arithmetic like "123 + 456 = ?" Hint: Think about how numbers are tokenized.

Exercise 3: Prompt Engineering

You want the model to complete "The answer is" with a number. Why might "The answer is 123" work better than "The answer is"? (Think about leading spaces!)