Getting Started
You don't need millions to train an LLM:
Small Model Recipe
- Model: 100M-1B parameters
- Data: 10B-100B tokens
- Compute: 1-8 GPUs for days/weeks
- Cost: Hundreds to thousands of dollars
Recommended Setup
# TinyLlama-style small model
vocab_size = 32000
d_model = 2048
n_layers = 22
n_heads = 32
# ~1.1B parameters
# Trainable on single A100 with good throughput
Tools
- PyTorch + Transformers: Standard stack
- DeepSpeed: Microsoft's training library
- Flash Attention: 2-4x speedup
- Weights & Biases: Experiment tracking
Start small! Train a 100M parameter model first.
Learn the pipeline before scaling up.
Hands-On Exercises
Exercise 1: Calculate Training Compute
Given a model with 500M parameters and a dataset of 20B tokens, calculate:
- Total FLOPs needed (use 6 × params × tokens)
- Training time on a single A100 (312 TFLOPS)
- Estimated cost at $3/hour for cloud GPU
Solution:
# FLOPs calculation
flops = 6 * 500_000_000 * 20_000_000_000 # 6 * params * tokens
flops = 60e18 # 60 exaFLOPs
# Training time (assuming 30% utilization)
a100_flops = 312e12 # 312 TFLOPS
utilization = 0.30
seconds = flops / (a100_flops * utilization)
hours = seconds / 3600 # ~178 hours or ~7.4 days
# Cost
cost = hours * 3 # ~$534
Exercise 2: Implement Mini Training Loop
Complete the missing parts of this PyTorch training loop:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# TODO: Complete the training step
def train_step(batch_text):
# 1. Tokenize the batch
inputs = tokenizer(batch_text, return_tensors='pt',
padding=True, truncation=True)
# 2. Forward pass (what are the labels?)
outputs = model(__________, labels=__________)
# 3. Get loss and backprop
loss = outputs.______
______.backward()
# 4. Update weights
optimizer.______()
optimizer.______()
return loss.item()
Answer: inputs['input_ids'], inputs['input_ids'], loss, loss, step(), zero_grad()