Lesson 6: Probability Distributions

Probability Basics

P(X) = probability of event X occurring

# Discrete: P(X = x_i) = p_i
# Coin flip: P(Heads) = 0.5

# Continuous: described by probability density
# P(a ≤ X ≤ b) = integral from a to b of p(x)dx
      

Common Distributions

Uniform: All outcomes equally likely
Normal (Gaussian): Bell curve, very common
Bernoulli: Binary outcomes (0 or 1)
Categorical: Multiple discrete outcomes

In Neural Networks

# Output layer: softmax gives probabilities
P(y=i|x) = exp(z_i) / sum_j(exp(z_j))

# Cross-entropy loss: -log P(y_true|x)
      

Knowledge Check

Q1: What is the key difference between discrete and continuous probability distributions?

A: Discrete distributions have countable outcomes with specific probabilities (e.g., coin flips), while continuous distributions have uncountable outcomes described by probability density functions (e.g., height measurements).

Q2: In a neural network's output layer, why is softmax used instead of raw logits?

A: Softmax converts logits into a valid probability distribution where all values are between 0 and 1 and sum to 1, making them interpretable as class probabilities.

Q3: What loss function is typically paired with softmax outputs?

A: Cross-entropy loss, which measures the difference between predicted probabilities and true labels by computing -log(P(y_true|x)).

Q4: When would you use a Bernoulli distribution versus a Categorical distribution?

A: Bernoulli for binary outcomes (2 classes), Categorical for multi-class outcomes (3+ classes).

Practical Examples

Example 1: Normal Distribution in Feature Scaling

When preprocessing data for neural networks, we often standardize features to have mean=0 and std=1 (standard normal distribution):

import numpy as np

# Raw feature: heights in cm
heights = np.array([160, 175, 168, 182, 170, 165, 178])

# Standardize to N(0, 1)
mean = np.mean(heights)
std = np.std(heights)
standardized = (heights - mean) / std

# Result: values centered around 0, most within [-2, 2]
print(standardized)  # [-1.23, 0.92, -0.46, 1.84, 0.00, -0.92, 1.23]
        

This helps gradients flow better during training and makes the network less sensitive to feature scales.

Example 2: Bernoulli Distribution for Binary Classification

A spam classifier outputs a single probability P(spam|email). This is a Bernoulli distribution:

# Model output: probability email is spam
p_spam = 0.85  # 85% confident it's spam

# Bernoulli: P(X=1) = p, P(X=0) = 1-p
# Expected value: E[X] = p = 0.85
# Variance: Var(X) = p(1-p) = 0.1275

# Decision threshold at 0.5
prediction = 1 if p_spam > 0.5 else 0  # Predict: spam (1)
        

The sigmoid activation in the final layer ensures output is a valid Bernoulli parameter (0 ≤ p ≤ 1).

Example 3: Categorical Distribution for Multi-Class

An image classifier predicting among 3 classes (cat, dog, bird):

import torch
import torch.nn.functional as F

# Raw logits from neural network
logits = torch.tensor([2.0, 1.0, 0.1])

# Softmax converts to probability distribution
probs = F.softmax(logits, dim=0)
# Result: [0.659, 0.242, 0.099] — sums to 1.0

# Predicted class
predicted = torch.argmax(probs)  # Class 0 (cat)

# Cross-entropy loss (true label = cat, index 0)
true_label = torch.tensor([0])
loss = F.cross_entropy(logits.unsqueeze(0), true_label)
# loss = -log(0.659) = 0.417
        

Softmax ensures a valid Categorical distribution; cross-entropy penalizes confident wrong predictions heavily.

Additional Quiz Questions

Q5: What is the expected value of a Bernoulli distribution with parameter p = 0.7?

A: E[X] = p = 0.7. For a Bernoulli distribution, the expected value is simply the probability parameter p.

Q6: In a normal distribution, what percentage of values fall within ±2 standard deviations of the mean?

A: Approximately 95% (95.4% to be precise). This is known as the empirical rule or 68-95-99.7 rule.

Q7: Why might you use a uniform distribution for initializing neural network weights?

A: Uniform distribution ensures all values in a range are equally likely, preventing bias toward specific initial weight values and promoting diverse feature learning.

Q8: What problem occurs if all logits in a softmax layer are equal?

A: The output probabilities become uniform (all equal), meaning the model has no preference for any class. This indicates the model hasn't learned meaningful distinctions between classes.

Key Takeaways

📊 Discrete vs Continuous: Discrete distributions (Bernoulli, Categorical) handle countable outcomes, while continuous distributions (Normal, Uniform) model uncountable values using probability density functions.

🎯 Softmax = Probability Distribution: Neural networks use softmax to convert logits into valid probability distributions where outputs sum to 1, enabling probabilistic interpretations of predictions.

🔗 Cross-Entropy is the Standard: Cross-entropy loss (-log P(y_true|x)) is the natural pairing with softmax outputs, penalizing confident wrong predictions more heavily than uncertain ones.

⚖️ Standardization Matters: Transforming features to follow a standard normal distribution (mean=0, std=1) improves gradient flow and training stability in neural networks.

🎲 Choose the Right Distribution: Bernoulli for binary classification, Categorical for multi-class, Normal for continuous data modeling, and Uniform for unbiased initialization.

Additional Quiz Questions

Q9: What is the variance of a Bernoulli distribution with parameter p = 0.3?

A: Var(X) = p(1-p) = 0.3 × 0.7 = 0.21. The variance measures the spread of the distribution, reaching maximum at p = 0.5.

Q10: In a standard normal distribution N(0,1), what is the probability of observing a value greater than 2?

A: Approximately 2.28% (or 0.0228). Since 95% of values fall within ±2σ, the remaining 5% is split equally between both tails.

Q11: Why is the softmax function called "soft" max rather than hard max?

A: Unlike argmax which returns a single 1 for the maximum value, softmax produces a probability distribution where larger values get higher probabilities but all values contribute, making it differentiable and suitable for gradient-based learning.

Q12: What happens to the entropy of a categorical distribution as one class probability approaches 1?

A: Entropy approaches 0 (minimum). When one outcome is certain, there's no uncertainty to measure. Maximum entropy occurs with uniform distribution where all classes are equally likely.

Q13: In weight initialization, why is sampling from a uniform distribution with small range (e.g., [-0.01, 0.01]) preferred over a large range?

A: Small initial weights prevent saturation of activation functions (like sigmoid/tanh) and help gradients flow during backpropagation. Large weights can cause neurons to saturate, leading to vanishing gradients.