đźš§ Lesson 1 of 35 in Level 04
Level 04 • Lesson 1

The Training Pipeline

Overview of LLM training: pre-training, fine-tuning, and alignment.

The Three Stages

Modern LLM Training Pipeline

  1. Pre-training: Learn general language from vast text corpus
  2. Supervised Fine-tuning (SFT): Adapt to specific tasks/formats
  3. Alignment (RLHF): Make helpful, harmless, honest

Stage 1: Pre-training

Stage 2: Fine-tuning

Stage 3: Alignment

Knowledge Check

Quiz: The Training Pipeline

1. What is the primary objective during pre-training?

  • a) Learning human preferences
  • b) Next token prediction
  • c) Code optimization

Answer: b) Next token prediction — the model learns to predict the next token in a sequence from vast text data.

2. Which stage uses RLHF (Reinforcement Learning from Human Feedback)?

  • a) Pre-training
  • b) Fine-tuning
  • c) Alignment

Answer: c) Alignment — RLHF is used to align the model with human values and preferences.

3. What type of data is used during supervised fine-tuning?

  • a) Raw web text
  • b) High-quality task-specific datasets
  • c) Random token sequences

Answer: b) High-quality task-specific datasets — curated data for specific tasks and formats.

4. What does "HHH" stand for in alignment?

  • a) High, Higher, Highest
  • b) Helpful, Harmless, Honest
  • c) Human, Hybrid, Hardware

Answer: b) Helpful, Harmless, Honest — the three key goals of alignment training.

5. Which stage typically requires the most compute resources?

  • a) Pre-training
  • b) Fine-tuning
  • c) Alignment

Answer: a) Pre-training — requires thousands of GPUs running for weeks or months on trillions of tokens.

Practice Exercise

Exercise 1: Design a Training Pipeline

Imagine you're training a customer service chatbot. Outline the three stages of your training pipeline:

  • Pre-training: What data sources would you use? What base model would you start with?
  • Fine-tuning: What specific datasets would you need? How would you structure the training examples?
  • Alignment: What human feedback would you collect? How would you ensure the bot is helpful but safe?

Solution Approach:

Start with a general-purpose LLM (like Llama or GPT). Fine-tune on customer service transcripts and support tickets. Use RLHF to align responses with company tone and safety guidelines.

Exercise 2: Compute Estimation

Given the following scenario, estimate the relative compute costs:

  • Pre-training: 1000 GPUs Ă— 30 days
  • Fine-tuning: 8 GPUs Ă— 2 days
  • Alignment: 100 GPUs Ă— 5 days

Question: What percentage of total compute does each stage represent?

Answer:

  • Pre-training: ~99.7% (30,000 GPU-days)
  • Fine-tuning: ~0.05% (16 GPU-days)
  • Alignment: ~1.7% (500 GPU-days)

Quick Quiz: Training Pipeline Concepts

Test Your Understanding

1. Why is pre-training called "unsupervised" learning?

  • a) No humans are involved
  • b) No labeled examples are needed—the model learns from raw text patterns
  • c) It runs without monitoring

Answer: b) No labeled examples are needed—the model learns from raw text patterns. The model predicts next tokens without explicit human annotations.

2. What is the main purpose of the alignment stage?

  • a) To make the model faster
  • b) To ensure the model produces helpful, harmless, and honest outputs
  • c) To reduce model size

Answer: b) To ensure the model produces helpful, harmless, and honest outputs. Alignment shapes model behavior to match human values.

3. Which training stage typically uses the smallest amount of data?

  • a) Pre-training (trillions of tokens)
  • b) Fine-tuning (millions to billions of tokens)
  • c) Alignment (thousands to millions of preference comparisons)

Answer: c) Alignment uses the least data—thousands to millions of human preference comparisons versus trillions of tokens in pre-training.

Coding Exercise: Training Pipeline Simulator

Exercise 3: Build a Mini Training Pipeline

Write a Python function that simulates the three stages of LLM training:

def training_pipeline(model_size, dataset_tokens, stages): """ Simulate LLM training pipeline stages. Args: model_size: Size in billions of parameters (e.g., 7 for 7B) dataset_tokens: Total tokens in training data (e.g., 1000000) stages: List of stage names ['pretrain', 'finetune', 'align'] Returns: Dict with stage names as keys and compute cost (GPU-hours) as values """ costs = {} for stage in stages: if stage == 'pretrain': # Pre-training: ~1000x more compute than fine-tuning costs[stage] = model_size * dataset_tokens * 0.0001 elif stage == 'finetune': # Fine-tuning: ~100x less data, focused training costs[stage] = model_size * dataset_tokens * 0.000001 elif stage == 'align': # Alignment: RLHF with preference data costs[stage] = model_size * dataset_tokens * 0.00001 return costs # Test your pipeline result = training_pipeline(7, 1_000_000, ['pretrain', 'finetune', 'align']) print(f"Training costs: {result}")

Expected Output:

Training costs: {'pretrain': 700.0, 'finetune': 7.0, 'align': 70.0}

Challenge: Modify the function to calculate total cost and identify which stage consumes the most resources.