Why Evaluation is Hard
Evaluating language models is surprisingly difficult:
- Many tasks: Translation, summarization, coding, reasoning, conversation...
- No single metric: Accuracy doesn't capture helpfulness or safety
- Subjectivity: "Good" writing is in the eye of the beholder
- Training contamination: Models may have seen test data during training
- Benchmark hacking: Optimizing for metrics rather than real utility
Major Benchmarks
MMLU (Massive Multitask Language Understanding)
57 subjects spanning STEM, humanities, social sciences. Multiple choice questions from elementary to professional level.
State of the art: GPT-4 scores ~86%, approaching human expert level.
HumanEval
164 programming problems. Model writes Python function from docstring, tested against unit tests.
Metric: pass@k (percentage of problems solved with k attempts)
TruthfulQA
Questions that humans often answer incorrectly due to misconceptions. Tests truthfulness, not just accuracy.
HellaSwag
Commonsense reasoning. Given a context, choose the most likely continuation.
ARC (AI2 Reasoning Challenge)
Grade-school science questions requiring reasoning. Easy and hard subsets.
Evaluation Metrics
Common Metrics
How "surprised" the model is by test data. Lower is better.
N-gram overlap with reference. Common for translation.
Recall-oriented overlap. Common for summarization.
Harmonic mean of precision and recall.
Limitations of Automatic Metrics
- Don't capture semantic meaning
- Can be gamed (generate keywords)
- Don't measure helpfulness or truthfulness
- Reference-based metrics penalize valid paraphrases
Human Evaluation
Ultimately, we care about human judgment. Common approaches:
1. Side-by-Side Comparison
Humans compare two model outputs and pick which is better. Aggregated into Elo ratings.
2. Absolute Ratings
Rate outputs on scales like 1-5 for helpfulness, accuracy, safety, etc.
3. Task Success
Did the model actually help the user complete their goal?
Current Leaderboard (Chatbot Arena)
| Rank | Model | Elo Rating |
|---|---|---|
| 1 | GPT-4 Turbo | ~1250 |
| 2 | Claude 3 Opus | ~1240 |
| 3 | GPT-4 | ~1200 |
| 4 | Claude 3 Sonnet | ~1180 |
| 5 | Gemini Pro | ~1150 |
Red Teaming and Safety
Beyond capabilities, we need to evaluate:
- Harmfulness: Will it help with dangerous tasks?
- Bias: Does it exhibit stereotypes?
- Hallucinations: Does it make things up?
- Jailbreak resistance: Can it be tricked into harmful outputs?
- Privacy: Does it leak training data?