Maximum Likelihood Estimation
Find parameters that make observed data most probable:
# θ* = argmax_θ P(data | θ)
# For independent samples:
P(data|θ) = P(x1|θ) * P(x2|θ) * ... * P(xn|θ)
Log-Likelihood
Products become sums (easier to optimize):
# log P(data|θ) = sum_i log P(xi|θ)
# Maximizing log-likelihood is same as
# maximizing likelihood (log is monotonic)
# Minimizing negative log-likelihood =
# minimizing cross-entropy loss!
In Language Models
Training objective: Maximize probability of next token given context.
Loss: Negative log-likelihood (cross-entropy).
Goal: Make model assign high probability to actual next tokens.
Loss: Negative log-likelihood (cross-entropy).
Goal: Make model assign high probability to actual next tokens.
✏️ Practice Exercises
Exercise 1: Coin Flip MLE
You observe 7 heads and 3 tails in 10 coin flips. What is the maximum likelihood estimate for the probability of heads?
Hint: Derive by maximizing P(data|θ) = θ^7 * (1-θ)^3
Solution: θ_MLE = 7/10 = 0.7. Take derivative of log-likelihood: d/dθ[7log(θ) + 3log(1-θ)] = 7/θ - 3/(1-θ) = 0 → 7(1-θ) = 3θ → 7 = 10θ → θ = 0.7
You observe 7 heads and 3 tails in 10 coin flips. What is the maximum likelihood estimate for the probability of heads?
Hint: Derive by maximizing P(data|θ) = θ^7 * (1-θ)^3
Solution: θ_MLE = 7/10 = 0.7. Take derivative of log-likelihood: d/dθ[7log(θ) + 3log(1-θ)] = 7/θ - 3/(1-θ) = 0 → 7(1-θ) = 3θ → 7 = 10θ → θ = 0.7
Exercise 2: Log-Likelihood Calculation
A language model assigns the following probabilities to the next token:
- P("cat") = 0.5 (correct token)
- P("dog") = 0.3
- P("bird") = 0.2
Calculate the negative log-likelihood loss. If the model improves to P("cat") = 0.8, what is the new loss?
Solution:
Original: -log(0.5) = 0.693
Improved: -log(0.8) = 0.223
Lower loss = better model (higher probability assigned to correct token)
A language model assigns the following probabilities to the next token:
- P("cat") = 0.5 (correct token)
- P("dog") = 0.3
- P("bird") = 0.2
Calculate the negative log-likelihood loss. If the model improves to P("cat") = 0.8, what is the new loss?
Solution:
Original: -log(0.5) = 0.693
Improved: -log(0.8) = 0.223
Lower loss = better model (higher probability assigned to correct token)
📝 Key Takeaways
- MLE Principle: Find parameters θ that maximize P(data|θ) — the probability of observing the given data.
- Log-Trick: Convert products to sums using log-likelihood for easier optimization. Maximizing log-likelihood is equivalent to maximizing likelihood.
- Connection to Loss: Negative log-likelihood is cross-entropy loss — minimizing this is exactly what LLMs do during training.
- Intuition: MLE finds the "most plausible" explanation for the data under your model family.
- LLM Training: The model learns to assign high probability to actual next tokens in the training text, one token at a time.
🧠 Quick Quiz
Question 1: Why do we use log-likelihood instead of likelihood for optimization?
A) Because log-likelihood gives different optimal parameters
B) Because sums are easier to differentiate than products
C) Because likelihood can never exceed 1.0
D) Because log-likelihood requires less data
Answer: B) Because sums are easier to differentiate than products. The log function converts products into sums, making gradient computation simpler while preserving the location of the maximum (log is monotonically increasing).
A) Because log-likelihood gives different optimal parameters
B) Because sums are easier to differentiate than products
C) Because likelihood can never exceed 1.0
D) Because log-likelihood requires less data
Answer: B) Because sums are easier to differentiate than products. The log function converts products into sums, making gradient computation simpler while preserving the location of the maximum (log is monotonically increasing).
Question 2: You observe a language model assigning P("hello") = 0.1 to the correct next token. What is the negative log-likelihood loss?
A) 0.1
B) 1.0
C) 2.303
D) 10.0
Answer: C) 2.303. The negative log-likelihood is -ln(0.1) = ln(10) ≈ 2.303. Lower probabilities result in higher loss values.
A) 0.1
B) 1.0
C) 2.303
D) 10.0
Answer: C) 2.303. The negative log-likelihood is -ln(0.1) = ln(10) ≈ 2.303. Lower probabilities result in higher loss values.
Question 3: In LLM training, what are we maximizing with MLE?
A) The probability of generating any valid sentence
B) The probability of the training data under the model
C) The model's parameter count
D) The diversity of generated outputs
Answer: B) The probability of the training data under the model. MLE finds parameters that make the observed training sequences most probable, which means assigning high probability to actual next tokens in the training text.
A) The probability of generating any valid sentence
B) The probability of the training data under the model
C) The model's parameter count
D) The diversity of generated outputs
Answer: B) The probability of the training data under the model. MLE finds parameters that make the observed training sequences most probable, which means assigning high probability to actual next tokens in the training text.
Question 4: If a fair coin (P(heads)=0.5) generates 20 heads in 20 flips, what is the likelihood of this data?
A) 0.5
B) 0.5^20 ≈ 9.5×10^-7
C) 20×0.5 = 10
D) 0.5/20 = 0.025
Answer: B) 0.5^20 ≈ 9.5×10^-7. For independent events, likelihood is the product: (0.5)^20. This extremely small number demonstrates why we use log-likelihood!
A) 0.5
B) 0.5^20 ≈ 9.5×10^-7
C) 20×0.5 = 10
D) 0.5/20 = 0.025
Answer: B) 0.5^20 ≈ 9.5×10^-7. For independent events, likelihood is the product: (0.5)^20. This extremely small number demonstrates why we use log-likelihood!
Question 5: What is the relationship between negative log-likelihood and cross-entropy?
A) They are completely different concepts
B) Negative log-likelihood equals cross-entropy for categorical distributions
C) Cross-entropy is always larger
D) They have opposite signs
Answer: B) Negative log-likelihood equals cross-entropy for categorical distributions. In classification and language modeling, minimizing negative log-likelihood is exactly the same as minimizing cross-entropy loss.
A) They are completely different concepts
B) Negative log-likelihood equals cross-entropy for categorical distributions
C) Cross-entropy is always larger
D) They have opposite signs
Answer: B) Negative log-likelihood equals cross-entropy for categorical distributions. In classification and language modeling, minimizing negative log-likelihood is exactly the same as minimizing cross-entropy loss.
💡 Practical Examples
Example 1: Estimating Email Open Rates
A marketer sends 1000 emails and 120 are opened. Using MLE, what's the estimated open rate?
Setup: Let θ = probability of opening an email. We observe 120 "successes" (opens) and 880 "failures" (unopened).
Likelihood: P(data|θ) = θ^120 × (1-θ)^880
Log-likelihood: log P = 120·log(θ) + 880·log(1-θ)
Solution: Taking derivative and setting to zero: 120/θ = 880/(1-θ) → 120(1-θ) = 880θ → 120 = 1000θ → θ_MLE = 0.12
Result: The maximum likelihood estimate is a 12% open rate.
A marketer sends 1000 emails and 120 are opened. Using MLE, what's the estimated open rate?
Setup: Let θ = probability of opening an email. We observe 120 "successes" (opens) and 880 "failures" (unopened).
Likelihood: P(data|θ) = θ^120 × (1-θ)^880
Log-likelihood: log P = 120·log(θ) + 880·log(1-θ)
Solution: Taking derivative and setting to zero: 120/θ = 880/(1-θ) → 120(1-θ) = 880θ → 120 = 1000θ → θ_MLE = 0.12
Result: The maximum likelihood estimate is a 12% open rate.
Example 2: Gaussian Mean Estimation
Given temperature readings [72, 75, 71, 73, 74]°F, what's the MLE for the mean assuming Gaussian noise?
Setup: For Gaussian distribution N(μ, σ²), the likelihood of observing data {x₁, x₂, ..., xₙ} depends on μ.
Log-likelihood: log P(data|μ) = -n/2·log(2πσ²) - 1/(2σ²)·Σ(xᵢ - μ)²
Solution: To maximize, we minimize Σ(xᵢ - μ)². Taking derivative: d/dμ[Σ(xᵢ - μ)²] = -2Σ(xᵢ - μ) = 0
This gives: Σxᵢ = nμ → μ_MLE = (72+75+71+73+74)/5 = 73°F
Key Insight: For Gaussian data, the MLE for the mean is simply the sample average!
Given temperature readings [72, 75, 71, 73, 74]°F, what's the MLE for the mean assuming Gaussian noise?
Setup: For Gaussian distribution N(μ, σ²), the likelihood of observing data {x₁, x₂, ..., xₙ} depends on μ.
Log-likelihood: log P(data|μ) = -n/2·log(2πσ²) - 1/(2σ²)·Σ(xᵢ - μ)²
Solution: To maximize, we minimize Σ(xᵢ - μ)². Taking derivative: d/dμ[Σ(xᵢ - μ)²] = -2Σ(xᵢ - μ) = 0
This gives: Σxᵢ = nμ → μ_MLE = (72+75+71+73+74)/5 = 73°F
Key Insight: For Gaussian data, the MLE for the mean is simply the sample average!
Example 3: LLM Next-Token Prediction
Consider the sentence fragment "The cat sat on the..." and a tiny vocabulary: {mat, hat, dog, car}.
Training Data: The complete sentence is "The cat sat on the mat"
Model Output (before training):
P(mat) = 0.25, P(hat) = 0.25, P(dog) = 0.25, P(car) = 0.25
Negative log-likelihood: -log(0.25) = 1.386
Model Output (after MLE training):
P(mat) = 0.70, P(hat) = 0.15, P(dog) = 0.10, P(car) = 0.05
Negative log-likelihood: -log(0.70) = 0.357
Result: After training, the model assigns much higher probability to "mat" (the correct token), reducing the loss from 1.386 to 0.357. This is exactly how LLMs learn — by adjusting parameters to maximize the likelihood of actual training sequences!
Consider the sentence fragment "The cat sat on the..." and a tiny vocabulary: {mat, hat, dog, car}.
Training Data: The complete sentence is "The cat sat on the mat"
Model Output (before training):
P(mat) = 0.25, P(hat) = 0.25, P(dog) = 0.25, P(car) = 0.25
Negative log-likelihood: -log(0.25) = 1.386
Model Output (after MLE training):
P(mat) = 0.70, P(hat) = 0.15, P(dog) = 0.10, P(car) = 0.05
Negative log-likelihood: -log(0.70) = 0.357
Result: After training, the model assigns much higher probability to "mat" (the correct token), reducing the loss from 1.386 to 0.357. This is exactly how LLMs learn — by adjusting parameters to maximize the likelihood of actual training sequences!