🚧 Lesson 6 of 10 in Level 01
Level 01 • Lesson 6

Model Sizes and Capabilities

Understanding parameter counts, emergent abilities, and what bigger models can do that smaller ones can't.

What Are Parameters?

When we say GPT-3 has "175 billion parameters," what does that mean?

Parameters are the learned numbers (weights and biases) that the model adjusts during training. They're the "knowledge" the model has acquired.

Think of parameters like this:

Model Size Comparison

GPT-2 Small
124M
GPT-2 XL
1.5B
GPT-3
175B
GPT-4
~1.7T

Emergent Abilities

As models get bigger, they don't just get better at existing tasks — they gain entirely new capabilities. These are called emergent abilities.

Emergence: Capabilities that appear suddenly at certain scale thresholds, not gradually. A model might fail at a task at 10B parameters but succeed at 100B.
Capability 1B params 10B params 100B+ params
Basic text completion
Simple Q&A
Following instructions
Chain-of-thought reasoning
Code generation
Multi-step reasoning

Why Do Abilities Emerge?

Several theories:

Scaling Laws

Researchers discovered that model performance follows predictable patterns as you scale up:

Power Laws: Loss ∝ N^(-α) where N is model size and α ≈ 0.07-0.1
Double the model size → predictable improvement in loss

Three things can be scaled:

The Chinchilla paper (2022) found that most models were undertrained. For optimal performance, you need roughly 20 tokens per parameter.

# Chinchilla scaling laws Optimal training tokens = 20 × parameters GPT-3 (175B params): Trained on 300B tokens Chinchilla optimal: Should train on ~3.5T tokens Chinchilla (70B params): Trained on 1.4T tokens Better than GPT-3 despite being smaller!

What Size Do You Need?

Model Size Guide

Small (1B-7B)

  • Fast inference
  • Can run on consumer hardware
  • Good for simple tasks
  • Fine-tuning is cheap

Examples: GPT-2, DistilBERT, small LLaMA

Medium (7B-30B)

  • Good instruction following
  • Basic reasoning
  • Can run on single GPU
  • Good balance of cost/performance

Examples: LLaMA-2-13B, Mistral-7B

Large (30B-70B)

  • Strong reasoning
  • Code generation
  • Multi-step tasks
  • Needs multiple GPUs

Examples: LLaMA-2-70B, GPT-3

XLarge (100B+)

  • Advanced reasoning
  • Complex instructions
  • Few-shot learning
  • Expensive to run

Examples: GPT-4, Claude, PaLM

Memory Requirements

Running models requires significant memory:

# Memory requirement estimates (FP16) Model Size | Inference RAM | Training RAM (Adam) --------------|---------------|-------------------- 1B params | ~2 GB | ~8 GB 7B params | ~14 GB | ~56 GB 13B params | ~26 GB | ~104 GB 70B params | ~140 GB | ~560 GB 175B params | ~350 GB | ~1.4 TB

Quantization reduces memory by using lower precision (INT8, INT4):