Lesson 1: The Training Pipeline

The Three Stages

Modern LLM Training Pipeline

Pre-training: Learn general language from vast text corpus
Supervised Fine-tuning (SFT): Adapt to specific tasks/formats
Alignment (RLHF): Make helpful, harmless, honest

Stage 1: Pre-training

Data: Trillions of tokens from web, books, code
Objective: Next token prediction (CLM) or masked prediction (MLM)
Compute: Thousands of GPUs for weeks/months
Cost: Millions of dollars for large models

Stage 2: Fine-tuning

Data: High-quality task-specific datasets
Objective: Same as pre-training, but focused
Compute: Much less (hours to days)

Stage 3: Alignment

Data: Human preferences and demonstrations
Objective: RLHF or similar methods
Goal: Helpful, harmless, honest responses

Knowledge Check

Quiz: The Training Pipeline

1. What is the primary objective during pre-training?

a) Learning human preferences
b) Next token prediction
c) Code optimization

Answer: b) Next token prediction — the model learns to predict the next token in a sequence from vast text data.

2. Which stage uses RLHF (Reinforcement Learning from Human Feedback)?

a) Pre-training
b) Fine-tuning
c) Alignment

Answer: c) Alignment — RLHF is used to align the model with human values and preferences.

3. What type of data is used during supervised fine-tuning?

a) Raw web text
b) High-quality task-specific datasets
c) Random token sequences

Answer: b) High-quality task-specific datasets — curated data for specific tasks and formats.

4. What does "HHH" stand for in alignment?

a) High, Higher, Highest
b) Helpful, Harmless, Honest
c) Human, Hybrid, Hardware

Answer: b) Helpful, Harmless, Honest — the three key goals of alignment training.

5. Which stage typically requires the most compute resources?

a) Pre-training
b) Fine-tuning
c) Alignment

Answer: a) Pre-training — requires thousands of GPUs running for weeks or months on trillions of tokens.

Practice Exercise

Exercise 1: Design a Training Pipeline

Imagine you're training a customer service chatbot. Outline the three stages of your training pipeline:

Pre-training: What data sources would you use? What base model would you start with?
Fine-tuning: What specific datasets would you need? How would you structure the training examples?
Alignment: What human feedback would you collect? How would you ensure the bot is helpful but safe?

Solution Approach:

Start with a general-purpose LLM (like Llama or GPT). Fine-tune on customer service transcripts and support tickets. Use RLHF to align responses with company tone and safety guidelines.

Exercise 2: Compute Estimation

Given the following scenario, estimate the relative compute costs:

Pre-training: 1000 GPUs × 30 days
Fine-tuning: 8 GPUs × 2 days
Alignment: 100 GPUs × 5 days

Question: What percentage of total compute does each stage represent?

Answer:

Pre-training: ~99.7% (30,000 GPU-days)
Fine-tuning: ~0.05% (16 GPU-days)
Alignment: ~1.7% (500 GPU-days)

Quick Quiz: Training Pipeline Concepts

Test Your Understanding

1. Why is pre-training called "unsupervised" learning?

a) No humans are involved
b) No labeled examples are needed—the model learns from raw text patterns
c) It runs without monitoring

Answer: b) No labeled examples are needed—the model learns from raw text patterns. The model predicts next tokens without explicit human annotations.

2. What is the main purpose of the alignment stage?

a) To make the model faster
b) To ensure the model produces helpful, harmless, and honest outputs
c) To reduce model size

Answer: b) To ensure the model produces helpful, harmless, and honest outputs. Alignment shapes model behavior to match human values.

3. Which training stage typically uses the smallest amount of data?

a) Pre-training (trillions of tokens)
b) Fine-tuning (millions to billions of tokens)
c) Alignment (thousands to millions of preference comparisons)

Answer: c) Alignment uses the least data—thousands to millions of human preference comparisons versus trillions of tokens in pre-training.

Coding Exercise: Training Pipeline Simulator

Exercise 3: Build a Mini Training Pipeline

Write a Python function that simulates the three stages of LLM training:

def training_pipeline(model_size, dataset_tokens, stages):
    """
    Simulate LLM training pipeline stages.
    
    Args:
        model_size: Size in billions of parameters (e.g., 7 for 7B)
        dataset_tokens: Total tokens in training data (e.g., 1000000)
        stages: List of stage names ['pretrain', 'finetune', 'align']
    
    Returns:
        Dict with stage names as keys and compute cost (GPU-hours) as values
    """
    costs = {}
    for stage in stages:
        if stage == 'pretrain':
            # Pre-training: ~1000x more compute than fine-tuning
            costs[stage] = model_size * dataset_tokens * 0.0001
        elif stage == 'finetune':
            # Fine-tuning: ~100x less data, focused training
            costs[stage] = model_size * dataset_tokens * 0.000001
        elif stage == 'align':
            # Alignment: RLHF with preference data
            costs[stage] = model_size * dataset_tokens * 0.00001
    return costs

# Test your pipeline
result = training_pipeline(7, 1_000_000, ['pretrain', 'finetune', 'align'])
print(f"Training costs: {result}")
        

Expected Output:

Training costs: {'pretrain': 700.0, 'finetune': 7.0, 'align': 70.0}

Challenge: Modify the function to calculate total cost and identify which stage consumes the most resources.