Perplexity
The standard metric for language models:
Benchmarks
- HellaSwag: Commonsense reasoning
- MMLU: Multi-task understanding
- HumanEval: Code generation
- TruthfulQA: Truthfulness
When to Stop
Knowledge Check
Quiz
1. What does a perplexity of 100 mean?
- A) The model is 100% accurate
- B) The model is as confused as choosing uniformly from 100 options
- C) The model has 100 parameters
Show Answer
B — A perplexity of 100 means the model's uncertainty is equivalent to randomly selecting from 100 equally likely options.
2. Which benchmark tests code generation capabilities?
- A) MMLU
- B) HellaSwag
- C) HumanEval
Show Answer
C — HumanEval is specifically designed to evaluate code generation performance.
3. What indicates overfitting during training?
- A) Both training and validation loss decrease
- B) Training loss decreases while validation loss increases
- C) Both training and validation loss increase
Show Answer
B — Overfitting occurs when training loss continues to decrease but validation loss starts increasing, indicating the model is memorizing training data rather than generalizing.
4. Which benchmark evaluates truthfulness in model responses?
- A) TruthfulQA
- B) MMLU
- C) HellaSwag
Show Answer
A — TruthfulQA specifically measures how truthful and accurate model responses are.