Lesson 8: Data Pipeline

Tokenization

Convert text to token IDs:

# BPE tokenization
text = "Hello world"
tokens = ["Hello", " world"]  # or subwords: ["He", "llo", " world"]
ids = [15496, 995]  # token IDs
      

Batching

# Dynamic padding: pad to longest in batch
# Or packing: combine short sequences

# Efficient batching is crucial for throughput
batch_size = 32
sequence_length = 2048
      

Data Loading

Pre-tokenize and cache
Use memory-mapped files for large datasets
Shuffle and batch efficiently
Multiple workers for parallel loading

Key Takeaways

        Tokenization converts text into numerical token IDs using algorithms like BPE or WordPiece. The choice of tokenizer affects vocabulary size and model performance.
Batching strategies impact training efficiency: dynamic padding reduces wasted computation on short sequences, while packing maximizes GPU utilization by combining multiple short examples.
Data loading optimization is critical for training throughput. Pre-tokenizing datasets, using memory-mapped files, and parallel workers prevent the data pipeline from becoming a bottleneck.
Sequence length matters — longer sequences require more memory but provide more context. Balance batch size and sequence length based on available GPU memory.

      

Quick Quiz

Q1: What is the primary purpose of tokenization in LLM training?

A) To compress text files
B) To convert text into numerical token IDs the model can process
C) To translate text between languages
D) To remove punctuation from text

Show Answer

B) Tokenization converts raw text into numerical token IDs that the model can process. LLMs operate on numbers, not raw text.

Q2: Which batching strategy combines multiple short sequences to maximize GPU utilization?

A) Dynamic padding
B) Bucketing
C) Packing
D) Dropout batching

Show Answer

C) Packing combines multiple short sequences into a single longer sequence up to the maximum length, maximizing GPU utilization by reducing wasted computation on padding tokens.

Q3: Why is pre-tokenizing and caching datasets beneficial?

A) It reduces model accuracy
B) It prevents the data pipeline from becoming a training bottleneck
C) It increases the vocabulary size
D) It makes the dataset smaller

Show Answer

B) Pre-tokenizing and caching prevents the data pipeline from becoming a bottleneck during training. Tokenization is CPU-intensive; doing it once upfront saves computation during each training epoch.

Q4: What trade-off exists between batch size and sequence length?

A) Larger batches always improve model quality
B) Longer sequences require more GPU memory, reducing possible batch size
C) Sequence length has no impact on memory usage
D) Batch size only affects training speed, not memory

Show Answer

B) Longer sequences require more GPU memory. Since GPU memory is finite, increasing sequence length means you must decrease batch size (or vice versa) to fit within memory constraints.