Lesson 7: Maximum Likelihood

Maximum Likelihood Estimation

Find parameters that make observed data most probable:

# θ* = argmax_θ P(data | θ)

# For independent samples:
P(data|θ) = P(x1|θ) * P(x2|θ) * ... * P(xn|θ)
      

Log-Likelihood

Products become sums (easier to optimize):

# log P(data|θ) = sum_i log P(xi|θ)

# Maximizing log-likelihood is same as 
# maximizing likelihood (log is monotonic)

# Minimizing negative log-likelihood = 
# minimizing cross-entropy loss!
      

In Language Models

        Training objective: Maximize probability of next token given context.

        Loss: Negative log-likelihood (cross-entropy).

        Goal: Make model assign high probability to actual next tokens.

✏️ Practice Exercises

        Exercise 1: Coin Flip MLE

        You observe 7 heads and 3 tails in 10 coin flips. What is the maximum likelihood estimate for the probability of heads?

        Hint: Derive by maximizing P(data|θ) = θ^7 * (1-θ)^3

        Solution: θ_MLE = 7/10 = 0.7. Take derivative of log-likelihood: d/dθ[7log(θ) + 3log(1-θ)] = 7/θ - 3/(1-θ) = 0 → 7(1-θ) = 3θ → 7 = 10θ → θ = 0.7

        Exercise 2: Log-Likelihood Calculation

        A language model assigns the following probabilities to the next token:

        - P("cat") = 0.5 (correct token)

        - P("dog") = 0.3

        - P("bird") = 0.2

        Calculate the negative log-likelihood loss. If the model improves to P("cat") = 0.8, what is the new loss?

        Solution:

        Original: -log(0.5) = 0.693

        Improved: -log(0.8) = 0.223

        Lower loss = better model (higher probability assigned to correct token)

📝 Key Takeaways

        MLE Principle: Find parameters θ that maximize P(data|θ) — the probability of observing the given data.
Log-Trick: Convert products to sums using log-likelihood for easier optimization. Maximizing log-likelihood is equivalent to maximizing likelihood.
Connection to Loss: Negative log-likelihood is cross-entropy loss — minimizing this is exactly what LLMs do during training.
Intuition: MLE finds the "most plausible" explanation for the data under your model family.
LLM Training: The model learns to assign high probability to actual next tokens in the training text, one token at a time.

      

🧠 Quick Quiz

        Question 1: Why do we use log-likelihood instead of likelihood for optimization?

        A) Because log-likelihood gives different optimal parameters

        B) Because sums are easier to differentiate than products

        C) Because likelihood can never exceed 1.0

        D) Because log-likelihood requires less data

        Answer: B) Because sums are easier to differentiate than products. The log function converts products into sums, making gradient computation simpler while preserving the location of the maximum (log is monotonically increasing).

        Question 2: You observe a language model assigning P("hello") = 0.1 to the correct next token. What is the negative log-likelihood loss?

        A) 0.1

        B) 1.0

        C) 2.303

        D) 10.0

        Answer: C) 2.303. The negative log-likelihood is -ln(0.1) = ln(10) ≈ 2.303. Lower probabilities result in higher loss values.

        Question 3: In LLM training, what are we maximizing with MLE?

        A) The probability of generating any valid sentence

        B) The probability of the training data under the model

        C) The model's parameter count

        D) The diversity of generated outputs

        Answer: B) The probability of the training data under the model. MLE finds parameters that make the observed training sequences most probable, which means assigning high probability to actual next tokens in the training text.

        Question 4: If a fair coin (P(heads)=0.5) generates 20 heads in 20 flips, what is the likelihood of this data?

        A) 0.5

        B) 0.5^20 ≈ 9.5×10^-7

        C) 20×0.5 = 10

        D) 0.5/20 = 0.025

        Answer: B) 0.5^20 ≈ 9.5×10^-7. For independent events, likelihood is the product: (0.5)^20. This extremely small number demonstrates why we use log-likelihood!

        Question 5: What is the relationship between negative log-likelihood and cross-entropy?

        A) They are completely different concepts

        B) Negative log-likelihood equals cross-entropy for categorical distributions

        C) Cross-entropy is always larger

        D) They have opposite signs

        Answer: B) Negative log-likelihood equals cross-entropy for categorical distributions. In classification and language modeling, minimizing negative log-likelihood is exactly the same as minimizing cross-entropy loss.

💡 Practical Examples

        Example 1: Estimating Email Open Rates

        A marketer sends 1000 emails and 120 are opened. Using MLE, what's the estimated open rate?

        Setup: Let θ = probability of opening an email. We observe 120 "successes" (opens) and 880 "failures" (unopened).

        Likelihood: P(data|θ) = θ^120 × (1-θ)^880

        Log-likelihood: log P = 120·log(θ) + 880·log(1-θ)

        Solution: Taking derivative and setting to zero: 120/θ = 880/(1-θ) → 120(1-θ) = 880θ → 120 = 1000θ → θ_MLE = 0.12

        Result: The maximum likelihood estimate is a 12% open rate.

        Example 2: Gaussian Mean Estimation

        Given temperature readings [72, 75, 71, 73, 74]°F, what's the MLE for the mean assuming Gaussian noise?

        Setup: For Gaussian distribution N(μ, σ²), the likelihood of observing data {x₁, x₂, ..., xₙ} depends on μ.

        Log-likelihood: log P(data|μ) = -n/2·log(2πσ²) - 1/(2σ²)·Σ(xᵢ - μ)²

        Solution: To maximize, we minimize Σ(xᵢ - μ)². Taking derivative: d/dμ[Σ(xᵢ - μ)²] = -2Σ(xᵢ - μ) = 0

        This gives: Σxᵢ = nμ → μ_MLE = (72+75+71+73+74)/5 = 73°F

        Key Insight: For Gaussian data, the MLE for the mean is simply the sample average!

        Example 3: LLM Next-Token Prediction

        Consider the sentence fragment "The cat sat on the..." and a tiny vocabulary: {mat, hat, dog, car}.

        Training Data: The complete sentence is "The cat sat on the mat"

        Model Output (before training):

        P(mat) = 0.25, P(hat) = 0.25, P(dog) = 0.25, P(car) = 0.25

        Negative log-likelihood: -log(0.25) = 1.386

        Model Output (after MLE training):

        P(mat) = 0.70, P(hat) = 0.15, P(dog) = 0.10, P(car) = 0.05

        Negative log-likelihood: -log(0.70) = 0.357

        Result: After training, the model assigns much higher probability to "mat" (the correct token), reducing the loss from 1.386 to 0.357. This is exactly how LLMs learn — by adjusting parameters to maximize the likelihood of actual training sequences!