🚧 Lesson 9 of 35 in Level 04
Level 04 • Lesson 9

Evaluation

Perplexity, benchmarks, and knowing when to stop training.

Perplexity

The standard metric for language models:

Perplexity = exp(average cross-entropy loss) Lower is better. PPL of 100 = model is as confused as choosing uniformly from 100 options.

Benchmarks

When to Stop

Early stopping: Stop when validation loss plateaus. Overfitting: Training loss ↓ but validation loss ↑ Checkpointing: Save best model based on validation.

Knowledge Check

Quiz

1. What does a perplexity of 100 mean?

  • A) The model is 100% accurate
  • B) The model is as confused as choosing uniformly from 100 options
  • C) The model has 100 parameters
Show Answer

B — A perplexity of 100 means the model's uncertainty is equivalent to randomly selecting from 100 equally likely options.

2. Which benchmark tests code generation capabilities?

  • A) MMLU
  • B) HellaSwag
  • C) HumanEval
Show Answer

C — HumanEval is specifically designed to evaluate code generation performance.

3. What indicates overfitting during training?

  • A) Both training and validation loss decrease
  • B) Training loss decreases while validation loss increases
  • C) Both training and validation loss increase
Show Answer

B — Overfitting occurs when training loss continues to decrease but validation loss starts increasing, indicating the model is memorizing training data rather than generalizing.

4. Which benchmark evaluates truthfulness in model responses?

  • A) TruthfulQA
  • B) MMLU
  • C) HellaSwag
Show Answer

A — TruthfulQA specifically measures how truthful and accurate model responses are.