🚧 Lesson 4 of 8 in Level 02
Level 02 • Lesson 4

Loss Functions

How we measure prediction error. MSE, cross-entropy, and why language models use perplexity.

What is a Loss Function?

A loss function (or cost function) measures how wrong our predictions are. It quantifies the difference between what the model predicts and the true target values.

Key Properties:
• Loss = 0 means perfect prediction
• Higher loss means worse prediction
• Must be differentiable (for gradient descent)

Training Objective

Training a neural network means:

Find parameters θ that minimize: Loss(θ) = (1/N) Σ loss(f(xᵢ; θ), yᵢ) Where: - f(xᵢ; θ) is the model prediction - yᵢ is the true target - N is the number of examples

Common Loss Functions

Mean Squared Error (MSE)

MSE = (1/n) Σ(y_pred - y_true)²

Used for regression. Penalizes large errors more heavily.

When to use: Predicting continuous values (prices, temperatures)

Mean Absolute Error (MAE)

MAE = (1/n) Σ|y_pred - y_true|

Used for regression. More robust to outliers than MSE.

When to use: When outliers shouldn't dominate

Binary Cross-Entropy

BCE = -[y·log(p) + (1-y)·log(1-p)]

Used for binary classification. Compares predicted probability to true label.

When to use: Yes/no classification (spam, fraud)

Categorical Cross-Entropy

CE = -Σ yᵢ·log(pᵢ)

Used for multi-class classification. y is one-hot encoded.

When to use: Multi-class classification (MNIST digits)

Cross-Entropy for Language Models

Language models predict the next token from a vocabulary of thousands or millions of tokens. This is a multi-class classification problem with many classes.

Example: Predicting "Paris"

Context: "The capital of France is"

Target token: " Paris" (ID: 15496)

Model outputs logits: [0.5, -1.2, 2.1, ..., -0.3] (50,000 values) After softmax: [0.02, 0.001, 0.15, ..., 0.0001] (probabilities) True label (one-hot): [0, 0, ..., 1, ..., 0] (1 at position 15496) Cross-Entropy = -log(p[15496]) = -log(0.15) ≈ 1.9

If the model was confident (p = 0.9): CE = -log(0.9) ≈ 0.1
If the model was wrong (p = 0.01): CE = -log(0.01) ≈ 4.6

Why Cross-Entropy?

Perplexity

Perplexity is the standard metric for language models. It's a transformation of cross-entropy:

Perplexity = exp(CrossEntropy)
Intuition: Perplexity ≈ "effective vocabulary size"
- Perplexity = 100 means the model is as confused as if choosing from 100 equally-likely tokens
- Perplexity = 10 means it's as confused as choosing from 10 options
- Lower is better!

Typical Perplexity Values

Model Perplexity
Random guessing~50,000
Simple n-gram model~500
LSTM~60
GPT-2~20
GPT-3~10
Human-level~5-8

Exercises

Exercise 1: MSE Calculation

Predictions: [2.5, 3.0, 4.5], True values: [2.0, 3.5, 4.0]. Calculate MSE.

Exercise 2: Cross-Entropy

A model predicts the correct token with probability 0.7. What is the cross-entropy loss? What if it predicted with probability 0.1?

Exercise 3: Perplexity

If a language model has average cross-entropy of 2.3 on a dataset, what is its perplexity?