Lesson 6: Model Sizes and Capabilities

What Are Parameters?

When we say GPT-3 has "175 billion parameters," what does that mean?

        Parameters are the learned numbers (weights and biases) that the model adjusts during training. 
        They're the "knowledge" the model has acquired.
      

Think of parameters like this:

Each connection between neurons has a weight
Each neuron has a bias
Embeddings are parameters
Attention weights are parameters

Model Size Comparison

GPT-2 Small

124M

GPT-2 XL

1.5B

GPT-3

175B

GPT-4

~1.7T

Emergent Abilities

As models get bigger, they don't just get better at existing tasks — they gain entirely new capabilities. These are called emergent abilities.

        Emergence: Capabilities that appear suddenly at certain scale thresholds, not gradually. 
        A model might fail at a task at 10B parameters but succeed at 100B.
      

Capability	1B params	10B params	100B+ params
Basic text completion	✓	✓	✓
Simple Q&A	✓	✓	✓
Following instructions	✗	✓	✓
Chain-of-thought reasoning	✗	✗	✓
Code generation	✗	✓	✓
Multi-step reasoning	✗	✗	✓

Why Do Abilities Emerge?

Several theories:

More capacity: Larger models can store more patterns and relationships
Composition: Complex abilities require composing simpler ones
Better representations: Larger models learn more useful internal representations
Critical mass: Some tasks need a minimum amount of "knowledge" to work

Scaling Laws

Researchers discovered that model performance follows predictable patterns as you scale up:

        Power Laws: Loss ∝ N^(-α) where N is model size and α ≈ 0.07-0.1

        Double the model size → predictable improvement in loss

Three things can be scaled:

Model size (N): Number of parameters
Data size (D): Number of training tokens
Compute (C): Training FLOPs

The Chinchilla paper (2022) found that most models were undertrained. For optimal performance, you need roughly 20 tokens per parameter.

# Chinchilla scaling laws
Optimal training tokens = 20 × parameters

GPT-3 (175B params): Trained on 300B tokens
Chinchilla optimal: Should train on ~3.5T tokens

Chinchilla (70B params): Trained on 1.4T tokens
Better than GPT-3 despite being smaller!
      

What Size Do You Need?

Model Size Guide

Small (1B-7B)

Fast inference
Can run on consumer hardware
Good for simple tasks
Fine-tuning is cheap

Examples: GPT-2, DistilBERT, small LLaMA

Medium (7B-30B)

Good instruction following
Basic reasoning
Can run on single GPU
Good balance of cost/performance

Examples: LLaMA-2-13B, Mistral-7B

Large (30B-70B)

Strong reasoning
Code generation
Multi-step tasks
Needs multiple GPUs

Examples: LLaMA-2-70B, GPT-3

XLarge (100B+)

Advanced reasoning
Complex instructions
Few-shot learning
Expensive to run

Examples: GPT-4, Claude, PaLM

Memory Requirements

Running models requires significant memory:

# Memory requirement estimates (FP16)

Model Size    | Inference RAM | Training RAM (Adam)
--------------|---------------|--------------------
1B params     | ~2 GB         | ~8 GB
7B params     | ~14 GB        | ~56 GB
13B params    | ~26 GB        | ~104 GB
70B params    | ~140 GB       | ~560 GB
175B params   | ~350 GB       | ~1.4 TB
      

Quantization reduces memory by using lower precision (INT8, INT4):

FP32 (32-bit): Standard precision
FP16 (16-bit): Half precision, ~2x speedup
INT8 (8-bit): Quarter precision, ~4x memory savings
INT4 (4-bit): Eighth precision, ~8x memory savings