What Are Parameters?
When we say GPT-3 has "175 billion parameters," what does that mean?
Think of parameters like this:
- Each connection between neurons has a weight
- Each neuron has a bias
- Embeddings are parameters
- Attention weights are parameters
Model Size Comparison
Emergent Abilities
As models get bigger, they don't just get better at existing tasks — they gain entirely new capabilities. These are called emergent abilities.
| Capability | 1B params | 10B params | 100B+ params |
|---|---|---|---|
| Basic text completion | ✓ | ✓ | ✓ |
| Simple Q&A | ✓ | ✓ | ✓ |
| Following instructions | ✗ | ✓ | ✓ |
| Chain-of-thought reasoning | ✗ | ✗ | ✓ |
| Code generation | ✗ | ✓ | ✓ |
| Multi-step reasoning | ✗ | ✗ | ✓ |
Why Do Abilities Emerge?
Several theories:
- More capacity: Larger models can store more patterns and relationships
- Composition: Complex abilities require composing simpler ones
- Better representations: Larger models learn more useful internal representations
- Critical mass: Some tasks need a minimum amount of "knowledge" to work
Scaling Laws
Researchers discovered that model performance follows predictable patterns as you scale up:
Double the model size → predictable improvement in loss
Three things can be scaled:
- Model size (N): Number of parameters
- Data size (D): Number of training tokens
- Compute (C): Training FLOPs
The Chinchilla paper (2022) found that most models were undertrained. For optimal performance, you need roughly 20 tokens per parameter.
What Size Do You Need?
Model Size Guide
Small (1B-7B)
- Fast inference
- Can run on consumer hardware
- Good for simple tasks
- Fine-tuning is cheap
Examples: GPT-2, DistilBERT, small LLaMA
Medium (7B-30B)
- Good instruction following
- Basic reasoning
- Can run on single GPU
- Good balance of cost/performance
Examples: LLaMA-2-13B, Mistral-7B
Large (30B-70B)
- Strong reasoning
- Code generation
- Multi-step tasks
- Needs multiple GPUs
Examples: LLaMA-2-70B, GPT-3
XLarge (100B+)
- Advanced reasoning
- Complex instructions
- Few-shot learning
- Expensive to run
Examples: GPT-4, Claude, PaLM
Memory Requirements
Running models requires significant memory:
Quantization reduces memory by using lower precision (INT8, INT4):
- FP32 (32-bit): Standard precision
- FP16 (16-bit): Half precision, ~2x speedup
- INT8 (8-bit): Quarter precision, ~4x memory savings
- INT4 (4-bit): Eighth precision, ~8x memory savings