Lesson 1: Tokenization Deep Dive

Why Tokenization Matters

Tokenization is the invisible first step of every LLM. Before any neural network processing happens, your text must be converted to numbers. The quality of this conversion affects everything downstream.

        Key Insight: Different tokenizers produce different sequences from the same text. 
        This means the same model architecture can behave differently depending on how it tokenizes.
      

The Tokenization Trade-off

Tokenization involves balancing three competing goals:

Vocabulary size: Fewer tokens = simpler model, but more tokens needed per word
Sequence length: Shorter sequences = faster processing, but less granularity
Out-of-vocabulary handling: Must handle any input, including typos and rare words

Byte Pair Encoding (BPE) Algorithm

BPE was originally a compression algorithm, but OpenAI adapted it for tokenization in GPT-2. Here's how it works:

Training Phase

BPE Training Example

Training corpus: "low lower lowest"

Step 0: Initialize vocabulary with characters

l o w e r s t

Step 1: Count adjacent pairs

"l o" appears 3 times → Merge "lo"

lo w e r s t

Step 2: Count again

"lo w" appears 3 times → Merge "low"

low e r s t

Step 3: Continue merging

"low e" appears 2 times → Merge "lowe"

lowe r s t

Continue until vocabulary reaches target size (e.g., 50,000 tokens)

Implementation

def train_bpe(texts, vocab_size, special_tokens):
    # Initialize vocabulary with all characters
    vocab = set(''.join(texts))
    vocab.update(special_tokens)
    
    # Initialize merges list
    merges = []
    
    # Convert texts to initial tokenization (character level)
    tokenized = [[list(word) + [''] for word in text.split()]
                 for text in texts]
    
    while len(vocab) < vocab_size:
        # Count all adjacent pairs
        pairs = get_pair_counts(tokenized)
        
        if not pairs:
            break
        
        # Find most frequent pair
        best_pair = max(pairs, key=pairs.get)
        
        # Merge all occurrences
        tokenized = merge_all(tokenized, best_pair)
        
        # Add to vocabulary and merges
        new_token = best_pair[0] + best_pair[1]
        vocab.add(new_token)
        merges.append(best_pair)
    
    return vocab, merges

def encode(text, vocab, merges):
    # Start with character-level tokenization
    tokens = list(text)
    
    # Apply merges in order
    for merge in merges:
        tokens = apply_merge(tokens, merge)
    
    return tokens
      

Special Tokens

Every tokenizer includes special tokens that serve specific purposes:

Token	Purpose	Example Use
<\|endoftext\|>	Document separator / end marker	Separate training documents
<\|startoftext\|>	Beginning of sequence	Mark generation start
<\|pad\|>	Padding token	Fill variable-length sequences
<\|unk\|>	Unknown token	Replace out-of-vocabulary items
<\|mask\|>	Masked position	MLM training (BERT-style)
<\|im_start\|>	Instruction start	Chat/instruction templates
<\|im_end\|>	Instruction end	Chat/instruction templates

        Chat Templates: Modern LLMs use special tokens to structure conversations:
        
        <|im_start|>user<|im_sep|>Hello!<|im_end|><|im_start|>assistant<|im_sep|>Hi there!<|im_end|>

Tokenization Quirks

Tokenization has some surprising behaviors you should know about:

1. Leading Space

Most tokenizers attach leading spaces to words. This means "hello" and " hello" are different tokens!

2. Case Sensitivity

"Hello" and "hello" are usually different tokens. Some tokenizers use byte-level BPE to handle this more gracefully.

3. Number Splitting

Numbers get split in surprising ways. "123" might become "12" + "3" or "1" + "23" depending on the tokenizer.

4. Unicode Handling

Emojis and non-ASCII characters can explode into many tokens. A single emoji might become 5-10 tokens!

Token Count Comparison

Input	Characters	Tokens (GPT-4)
"hello"	5	1
"Hello"	5	1
"12345"	5	1
"1234567890"	10	2-3
"😀"	1	1-2
"🎉"	1	2-4
"supercalifragilistic"	20	3-5

Practice Exercises

Exercise 1: Manual BPE

Given the training text: "aa ab ab ac"
Run 3 merge operations of BPE. What tokens are in your vocabulary?

Exercise 2: Token Counting

Why might a model struggle with simple arithmetic like "123 + 456 = ?" Hint: Think about how numbers are tokenized.

Exercise 3: Prompt Engineering

You want the model to complete "The answer is" with a number. Why might "The answer is 123" work better than "The answer is"? (Think about leading spaces!)