🚧 Lesson 8 of 35 in Level 04
Level 04 • Lesson 8

Data Pipeline

Tokenization, batching, data loading, and preprocessing for training.

Tokenization

Convert text to token IDs:

# BPE tokenization text = "Hello world" tokens = ["Hello", " world"] # or subwords: ["He", "llo", " world"] ids = [15496, 995] # token IDs

Batching

# Dynamic padding: pad to longest in batch # Or packing: combine short sequences # Efficient batching is crucial for throughput batch_size = 32 sequence_length = 2048

Data Loading

Key Takeaways

  • Tokenization converts text into numerical token IDs using algorithms like BPE or WordPiece. The choice of tokenizer affects vocabulary size and model performance.
  • Batching strategies impact training efficiency: dynamic padding reduces wasted computation on short sequences, while packing maximizes GPU utilization by combining multiple short examples.
  • Data loading optimization is critical for training throughput. Pre-tokenizing datasets, using memory-mapped files, and parallel workers prevent the data pipeline from becoming a bottleneck.
  • Sequence length matters — longer sequences require more memory but provide more context. Balance batch size and sequence length based on available GPU memory.

Quick Quiz

Q1: What is the primary purpose of tokenization in LLM training?

  • A) To compress text files
  • B) To convert text into numerical token IDs the model can process
  • C) To translate text between languages
  • D) To remove punctuation from text
Show Answer

B) Tokenization converts raw text into numerical token IDs that the model can process. LLMs operate on numbers, not raw text.

Q2: Which batching strategy combines multiple short sequences to maximize GPU utilization?

  • A) Dynamic padding
  • B) Bucketing
  • C) Packing
  • D) Dropout batching
Show Answer

C) Packing combines multiple short sequences into a single longer sequence up to the maximum length, maximizing GPU utilization by reducing wasted computation on padding tokens.

Q3: Why is pre-tokenizing and caching datasets beneficial?

  • A) It reduces model accuracy
  • B) It prevents the data pipeline from becoming a training bottleneck
  • C) It increases the vocabulary size
  • D) It makes the dataset smaller
Show Answer

B) Pre-tokenizing and caching prevents the data pipeline from becoming a bottleneck during training. Tokenization is CPU-intensive; doing it once upfront saves computation during each training epoch.

Q4: What trade-off exists between batch size and sequence length?

  • A) Larger batches always improve model quality
  • B) Longer sequences require more GPU memory, reducing possible batch size
  • C) Sequence length has no impact on memory usage
  • D) Batch size only affects training speed, not memory
Show Answer

B) Longer sequences require more GPU memory. Since GPU memory is finite, increasing sequence length means you must decrease batch size (or vice versa) to fit within memory constraints.