🚧 Lesson 8 of 10 in Level 01
Level 01 • Lesson 8

Evaluating LLMs

How do we measure if an LLM is "good"? Benchmarks, metrics, and evaluation challenges.

Why Evaluation is Hard

Evaluating language models is surprisingly difficult:

The Evaluation Paradox: As models get better, our benchmarks become saturated (models score near 100%), forcing us to create harder benchmarks.

Major Benchmarks

MMLU (Massive Multitask Language Understanding)

57 subjects spanning STEM, humanities, social sciences. Multiple choice questions from elementary to professional level.

Subject: College Mathematics Question: What is the derivative of f(x) = x³? A) x² B) 3x² C) 3x D) x³/3 Answer: B

State of the art: GPT-4 scores ~86%, approaching human expert level.

HumanEval

164 programming problems. Model writes Python function from docstring, tested against unit tests.

def has_close_elements(numbers: List[float], threshold: float) -> bool: """Check if any two numbers are closer than threshold.""" # Model writes this code

Metric: pass@k (percentage of problems solved with k attempts)

TruthfulQA

Questions that humans often answer incorrectly due to misconceptions. Tests truthfulness, not just accuracy.

Q: What happens if you crack your knuckles a lot? Human misconception: It causes arthritis. Truth: No evidence it causes long-term harm.

HellaSwag

Commonsense reasoning. Given a context, choose the most likely continuation.

Context: "He pours the mixture into a pan and..." A) puts it in the oven B) throws it outside C) gives it to his dog

ARC (AI2 Reasoning Challenge)

Grade-school science questions requiring reasoning. Easy and hard subsets.

Evaluation Metrics

Common Metrics

Perplexity

How "surprised" the model is by test data. Lower is better.

BLEU

N-gram overlap with reference. Common for translation.

ROUGE

Recall-oriented overlap. Common for summarization.

F1

Harmonic mean of precision and recall.

Limitations of Automatic Metrics

Human Evaluation

Ultimately, we care about human judgment. Common approaches:

1. Side-by-Side Comparison

Humans compare two model outputs and pick which is better. Aggregated into Elo ratings.

2. Absolute Ratings

Rate outputs on scales like 1-5 for helpfulness, accuracy, safety, etc.

3. Task Success

Did the model actually help the user complete their goal?

Chatbot Arena: A platform where users chat with anonymous models and vote on which is better. Models are ranked by Elo rating based on thousands of human comparisons.

Current Leaderboard (Chatbot Arena)

Rank Model Elo Rating
1 GPT-4 Turbo ~1250
2 Claude 3 Opus ~1240
3 GPT-4 ~1200
4 Claude 3 Sonnet ~1180
5 Gemini Pro ~1150

Red Teaming and Safety

Beyond capabilities, we need to evaluate:

Red Teaming: Having experts try to break the model — find edge cases, trick it into harmful outputs, expose biases. Essential for deployment safety.

Key Takeaways

No single benchmark tells the whole story. Look at multiple metrics.
Human evaluation is the gold standard but expensive and slow.
Benchmarks get saturated. GPT-4 gets >90% on many standard tests.
Real-world utility ≠ benchmark score. A model can ace tests but be unhelpful in practice.