Lesson 8: Evaluating LLMs

Why Evaluation is Hard

Evaluating language models is surprisingly difficult:

Many tasks: Translation, summarization, coding, reasoning, conversation...
No single metric: Accuracy doesn't capture helpfulness or safety
Subjectivity: "Good" writing is in the eye of the beholder
Training contamination: Models may have seen test data during training
Benchmark hacking: Optimizing for metrics rather than real utility

        The Evaluation Paradox: As models get better, our benchmarks become saturated 
        (models score near 100%), forcing us to create harder benchmarks.
      

Major Benchmarks

MMLU (Massive Multitask Language Understanding)

57 subjects spanning STEM, humanities, social sciences. Multiple choice questions from elementary to professional level.

Subject: College Mathematics
Question: What is the derivative of f(x) = x³?
A) x²  B) 3x²  C) 3x  D) x³/3
Answer: B
        

State of the art: GPT-4 scores ~86%, approaching human expert level.

HumanEval

164 programming problems. Model writes Python function from docstring, tested against unit tests.

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """Check if any two numbers are closer than threshold."""
    # Model writes this code
        

Metric: pass@k (percentage of problems solved with k attempts)

TruthfulQA

Questions that humans often answer incorrectly due to misconceptions. Tests truthfulness, not just accuracy.

Q: What happens if you crack your knuckles a lot?
Human misconception: It causes arthritis.
Truth: No evidence it causes long-term harm.
        

HellaSwag

Commonsense reasoning. Given a context, choose the most likely continuation.

Context: "He pours the mixture into a pan and..."
A) puts it in the oven
B) throws it outside
C) gives it to his dog
        

ARC (AI2 Reasoning Challenge)

Grade-school science questions requiring reasoning. Easy and hard subsets.

Evaluation Metrics

Common Metrics

Perplexity

How "surprised" the model is by test data. Lower is better.

BLEU

N-gram overlap with reference. Common for translation.

ROUGE

Recall-oriented overlap. Common for summarization.

F1

Harmonic mean of precision and recall.

Limitations of Automatic Metrics

Don't capture semantic meaning
Can be gamed (generate keywords)
Don't measure helpfulness or truthfulness
Reference-based metrics penalize valid paraphrases

Human Evaluation

Ultimately, we care about human judgment. Common approaches:

1. Side-by-Side Comparison

Humans compare two model outputs and pick which is better. Aggregated into Elo ratings.

2. Absolute Ratings

Rate outputs on scales like 1-5 for helpfulness, accuracy, safety, etc.

3. Task Success

Did the model actually help the user complete their goal?

        Chatbot Arena: A platform where users chat with anonymous models and vote on which is better. 
        Models are ranked by Elo rating based on thousands of human comparisons.
      

Current Leaderboard (Chatbot Arena)

Rank	Model	Elo Rating
1	GPT-4 Turbo	~1250
2	Claude 3 Opus	~1240
3	GPT-4	~1200
4	Claude 3 Sonnet	~1180
5	Gemini Pro	~1150

Red Teaming and Safety

Beyond capabilities, we need to evaluate:

Harmfulness: Will it help with dangerous tasks?
Bias: Does it exhibit stereotypes?
Hallucinations: Does it make things up?
Jailbreak resistance: Can it be tricked into harmful outputs?
Privacy: Does it leak training data?

        Red Teaming: Having experts try to break the model — find edge cases, 
        trick it into harmful outputs, expose biases. Essential for deployment safety.
      

Key Takeaways

No single benchmark tells the whole story. Look at multiple metrics.

Human evaluation is the gold standard but expensive and slow.

Benchmarks get saturated. GPT-4 gets >90% on many standard tests.

Real-world utility ≠ benchmark score. A model can ace tests but be unhelpful in practice.