What is a Loss Function?
A loss function (or cost function) measures how wrong our predictions are. It quantifies the difference between what the model predicts and the true target values.
• Loss = 0 means perfect prediction
• Higher loss means worse prediction
• Must be differentiable (for gradient descent)
Training Objective
Training a neural network means:
Common Loss Functions
Mean Squared Error (MSE)
Used for regression. Penalizes large errors more heavily.
When to use: Predicting continuous values (prices, temperatures)
Mean Absolute Error (MAE)
Used for regression. More robust to outliers than MSE.
When to use: When outliers shouldn't dominate
Binary Cross-Entropy
Used for binary classification. Compares predicted probability to true label.
When to use: Yes/no classification (spam, fraud)
Categorical Cross-Entropy
Used for multi-class classification. y is one-hot encoded.
When to use: Multi-class classification (MNIST digits)
Cross-Entropy for Language Models
Language models predict the next token from a vocabulary of thousands or millions of tokens. This is a multi-class classification problem with many classes.
Example: Predicting "Paris"
Context: "The capital of France is"
Target token: " Paris" (ID: 15496)
If the model was confident (p = 0.9): CE = -log(0.9) ≈ 0.1
If the model was wrong (p = 0.01): CE = -log(0.01) ≈ 4.6
Why Cross-Entropy?
- Probabilistically sound: Minimizing CE = maximizing likelihood
- Gradients flow well: Even wrong predictions get meaningful gradients
- Penalizes confidence in wrong answers: Being confidently wrong is punished heavily
Perplexity
Perplexity is the standard metric for language models. It's a transformation of cross-entropy:
- Perplexity = 100 means the model is as confused as if choosing from 100 equally-likely tokens
- Perplexity = 10 means it's as confused as choosing from 10 options
- Lower is better!
Typical Perplexity Values
| Model | Perplexity |
|---|---|
| Random guessing | ~50,000 |
| Simple n-gram model | ~500 |
| LSTM | ~60 |
| GPT-2 | ~20 |
| GPT-3 | ~10 |
| Human-level | ~5-8 |
Exercises
Exercise 1: MSE Calculation
Predictions: [2.5, 3.0, 4.5], True values: [2.0, 3.5, 4.0]. Calculate MSE.
Exercise 2: Cross-Entropy
A model predicts the correct token with probability 0.7. What is the cross-entropy loss? What if it predicted with probability 0.1?
Exercise 3: Perplexity
If a language model has average cross-entropy of 2.3 on a dataset, what is its perplexity?