🚧 Lesson 10 of 35 in Level 04
Level 04 • Lesson 10

Training Your Own LLM

Practical guide to training small language models from scratch.

Getting Started

You don't need millions to train an LLM:

Small Model Recipe

  • Model: 100M-1B parameters
  • Data: 10B-100B tokens
  • Compute: 1-8 GPUs for days/weeks
  • Cost: Hundreds to thousands of dollars

Recommended Setup

# TinyLlama-style small model vocab_size = 32000 d_model = 2048 n_layers = 22 n_heads = 32 # ~1.1B parameters # Trainable on single A100 with good throughput

Tools

Start small! Train a 100M parameter model first. Learn the pipeline before scaling up.

Hands-On Exercises

Exercise 1: Calculate Training Compute

Given a model with 500M parameters and a dataset of 20B tokens, calculate:

  • Total FLOPs needed (use 6 × params × tokens)
  • Training time on a single A100 (312 TFLOPS)
  • Estimated cost at $3/hour for cloud GPU

Solution:

# FLOPs calculation flops = 6 * 500_000_000 * 20_000_000_000 # 6 * params * tokens flops = 60e18 # 60 exaFLOPs # Training time (assuming 30% utilization) a100_flops = 312e12 # 312 TFLOPS utilization = 0.30 seconds = flops / (a100_flops * utilization) hours = seconds / 3600 # ~178 hours or ~7.4 days # Cost cost = hours * 3 # ~$534

Exercise 2: Implement Mini Training Loop

Complete the missing parts of this PyTorch training loop:

import torch from transformers import GPT2LMHeadModel, GPT2Tokenizer model = GPT2LMHeadModel.from_pretrained('gpt2') tokenizer = GPT2Tokenizer.from_pretrained('gpt2') optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) # TODO: Complete the training step def train_step(batch_text): # 1. Tokenize the batch inputs = tokenizer(batch_text, return_tensors='pt', padding=True, truncation=True) # 2. Forward pass (what are the labels?) outputs = model(__________, labels=__________) # 3. Get loss and backprop loss = outputs.______ ______.backward() # 4. Update weights optimizer.______() optimizer.______() return loss.item()

Answer: inputs['input_ids'], inputs['input_ids'], loss, loss, step(), zero_grad()