Lesson 10: Training Your Own LLM

Getting Started

You don't need millions to train an LLM:

Small Model Recipe

Model: 100M-1B parameters
Data: 10B-100B tokens
Compute: 1-8 GPUs for days/weeks
Cost: Hundreds to thousands of dollars

Recommended Setup

# TinyLlama-style small model
vocab_size = 32000
d_model = 2048
n_layers = 22
n_heads = 32

# ~1.1B parameters
# Trainable on single A100 with good throughput
      

Tools

PyTorch + Transformers: Standard stack
DeepSpeed: Microsoft's training library
Flash Attention: 2-4x speedup
Weights & Biases: Experiment tracking

        Start small! Train a 100M parameter model first. 
        Learn the pipeline before scaling up.
      

Hands-On Exercises

Exercise 1: Calculate Training Compute

Given a model with 500M parameters and a dataset of 20B tokens, calculate:

Total FLOPs needed (use 6 × params × tokens)
Training time on a single A100 (312 TFLOPS)
Estimated cost at $3/hour for cloud GPU

Solution:

# FLOPs calculation
flops = 6 * 500_000_000 * 20_000_000_000  # 6 * params * tokens
flops = 60e18  # 60 exaFLOPs

# Training time (assuming 30% utilization)
a100_flops = 312e12  # 312 TFLOPS
utilization = 0.30
seconds = flops / (a100_flops * utilization)
hours = seconds / 3600  # ~178 hours or ~7.4 days

# Cost
cost = hours * 3  # ~$534
        

Exercise 2: Implement Mini Training Loop

Complete the missing parts of this PyTorch training loop:

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# TODO: Complete the training step
def train_step(batch_text):
    # 1. Tokenize the batch
    inputs = tokenizer(batch_text, return_tensors='pt', 
                       padding=True, truncation=True)
    
    # 2. Forward pass (what are the labels?)
    outputs = model(__________, labels=__________)
    
    # 3. Get loss and backprop
    loss = outputs.______
    ______.backward()
    
    # 4. Update weights
    optimizer.______()
    optimizer.______()
    
    return loss.item()
        

Answer: inputs['input_ids'], inputs['input_ids'], loss, loss, step(), zero_grad()