Lesson 4: Loss Functions

What is a Loss Function?

A loss function (or cost function) measures how wrong our predictions are. It quantifies the difference between what the model predicts and the true target values.

        Key Properties:

        • Loss = 0 means perfect prediction

        • Higher loss means worse prediction

        • Must be differentiable (for gradient descent)

Training Objective

Training a neural network means:

Find parameters θ that minimize:  Loss(θ) = (1/N) Σ loss(f(xᵢ; θ), yᵢ)

Where:
- f(xᵢ; θ) is the model prediction
- yᵢ is the true target
- N is the number of examples
      

Common Loss Functions

Mean Squared Error (MSE)

MSE = (1/n) Σ(y_pred - y_true)²

Used for regression. Penalizes large errors more heavily.

When to use: Predicting continuous values (prices, temperatures)

Mean Absolute Error (MAE)

MAE = (1/n) Σ|y_pred - y_true|

Used for regression. More robust to outliers than MSE.

When to use: When outliers shouldn't dominate

Binary Cross-Entropy

BCE = -[y·log(p) + (1-y)·log(1-p)]

Used for binary classification. Compares predicted probability to true label.

When to use: Yes/no classification (spam, fraud)

Categorical Cross-Entropy

CE = -Σ yᵢ·log(pᵢ)

Used for multi-class classification. y is one-hot encoded.

When to use: Multi-class classification (MNIST digits)

Cross-Entropy for Language Models

Language models predict the next token from a vocabulary of thousands or millions of tokens. This is a multi-class classification problem with many classes.

Example: Predicting "Paris"

Context: "The capital of France is"

Target token: " Paris" (ID: 15496)

Model outputs logits: [0.5, -1.2, 2.1, ..., -0.3]  (50,000 values)
After softmax:        [0.02, 0.001, 0.15, ..., 0.0001] (probabilities)

True label (one-hot): [0, 0, ..., 1, ..., 0]  (1 at position 15496)

Cross-Entropy = -log(p[15496]) = -log(0.15) ≈ 1.9

If the model was confident (p = 0.9): CE = -log(0.9) ≈ 0.1
If the model was wrong (p = 0.01): CE = -log(0.01) ≈ 4.6

Why Cross-Entropy?

Probabilistically sound: Minimizing CE = maximizing likelihood
Gradients flow well: Even wrong predictions get meaningful gradients
Penalizes confidence in wrong answers: Being confidently wrong is punished heavily

Perplexity

Perplexity is the standard metric for language models. It's a transformation of cross-entropy:

Perplexity = exp(CrossEntropy)

        Intuition: Perplexity ≈ "effective vocabulary size"

        - Perplexity = 100 means the model is as confused as if choosing from 100 equally-likely tokens

        - Perplexity = 10 means it's as confused as choosing from 10 options

        - Lower is better!

Typical Perplexity Values

Model	Perplexity
Random guessing	~50,000
Simple n-gram model	~500
LSTM	~60
GPT-2	~20
GPT-3	~10
Human-level	~5-8

Exercises

Exercise 1: MSE Calculation

Predictions: [2.5, 3.0, 4.5], True values: [2.0, 3.5, 4.0]. Calculate MSE.

Exercise 2: Cross-Entropy

A model predicts the correct token with probability 0.7. What is the cross-entropy loss? What if it predicted with probability 0.1?

Exercise 3: Perplexity

If a language model has average cross-entropy of 2.3 on a dataset, what is its perplexity?