Lesson 9: Evaluation

Perplexity

The standard metric for language models:

Perplexity = exp(average cross-entropy loss)

Lower is better. PPL of 100 = model is as confused as 
choosing uniformly from 100 options.
      

Benchmarks

HellaSwag: Commonsense reasoning
MMLU: Multi-task understanding
HumanEval: Code generation
TruthfulQA: Truthfulness

When to Stop

        Early stopping: Stop when validation loss plateaus.
        Overfitting: Training loss ↓ but validation loss ↑
        Checkpointing: Save best model based on validation.
      

Knowledge Check

Quiz

1. What does a perplexity of 100 mean?

A) The model is 100% accurate
B) The model is as confused as choosing uniformly from 100 options
C) The model has 100 parameters

Show Answer

B — A perplexity of 100 means the model's uncertainty is equivalent to randomly selecting from 100 equally likely options.

2. Which benchmark tests code generation capabilities?

A) MMLU
B) HellaSwag
C) HumanEval

Show Answer

C — HumanEval is specifically designed to evaluate code generation performance.

3. What indicates overfitting during training?

A) Both training and validation loss decrease
B) Training loss decreases while validation loss increases
C) Both training and validation loss increase

Show Answer

B — Overfitting occurs when training loss continues to decrease but validation loss starts increasing, indicating the model is memorizing training data rather than generalizing.

4. Which benchmark evaluates truthfulness in model responses?

A) TruthfulQA
B) MMLU
C) HellaSwag

Show Answer

A — TruthfulQA specifically measures how truthful and accurate model responses are.