Probability Basics
P(X) = probability of event X occurring
Common Distributions
- Uniform: All outcomes equally likely
- Normal (Gaussian): Bell curve, very common
- Bernoulli: Binary outcomes (0 or 1)
- Categorical: Multiple discrete outcomes
In Neural Networks
Knowledge Check
Q1: What is the key difference between discrete and continuous probability distributions?
A: Discrete distributions have countable outcomes with specific probabilities (e.g., coin flips), while continuous distributions have uncountable outcomes described by probability density functions (e.g., height measurements).
Q2: In a neural network's output layer, why is softmax used instead of raw logits?
A: Softmax converts logits into a valid probability distribution where all values are between 0 and 1 and sum to 1, making them interpretable as class probabilities.
Q3: What loss function is typically paired with softmax outputs?
A: Cross-entropy loss, which measures the difference between predicted probabilities and true labels by computing -log(P(y_true|x)).
Q4: When would you use a Bernoulli distribution versus a Categorical distribution?
A: Bernoulli for binary outcomes (2 classes), Categorical for multi-class outcomes (3+ classes).
Practical Examples
Example 1: Normal Distribution in Feature Scaling
When preprocessing data for neural networks, we often standardize features to have mean=0 and std=1 (standard normal distribution):
This helps gradients flow better during training and makes the network less sensitive to feature scales.
Example 2: Bernoulli Distribution for Binary Classification
A spam classifier outputs a single probability P(spam|email). This is a Bernoulli distribution:
The sigmoid activation in the final layer ensures output is a valid Bernoulli parameter (0 ≤ p ≤ 1).
Example 3: Categorical Distribution for Multi-Class
An image classifier predicting among 3 classes (cat, dog, bird):
Softmax ensures a valid Categorical distribution; cross-entropy penalizes confident wrong predictions heavily.
Additional Quiz Questions
Q5: What is the expected value of a Bernoulli distribution with parameter p = 0.7?
A: E[X] = p = 0.7. For a Bernoulli distribution, the expected value is simply the probability parameter p.
Q6: In a normal distribution, what percentage of values fall within ±2 standard deviations of the mean?
A: Approximately 95% (95.4% to be precise). This is known as the empirical rule or 68-95-99.7 rule.
Q7: Why might you use a uniform distribution for initializing neural network weights?
A: Uniform distribution ensures all values in a range are equally likely, preventing bias toward specific initial weight values and promoting diverse feature learning.
Q8: What problem occurs if all logits in a softmax layer are equal?
A: The output probabilities become uniform (all equal), meaning the model has no preference for any class. This indicates the model hasn't learned meaningful distinctions between classes.
Key Takeaways
📊 Discrete vs Continuous: Discrete distributions (Bernoulli, Categorical) handle countable outcomes, while continuous distributions (Normal, Uniform) model uncountable values using probability density functions.
🎯 Softmax = Probability Distribution: Neural networks use softmax to convert logits into valid probability distributions where outputs sum to 1, enabling probabilistic interpretations of predictions.
🔗 Cross-Entropy is the Standard: Cross-entropy loss (-log P(y_true|x)) is the natural pairing with softmax outputs, penalizing confident wrong predictions more heavily than uncertain ones.
⚖️ Standardization Matters: Transforming features to follow a standard normal distribution (mean=0, std=1) improves gradient flow and training stability in neural networks.
🎲 Choose the Right Distribution: Bernoulli for binary classification, Categorical for multi-class, Normal for continuous data modeling, and Uniform for unbiased initialization.
Additional Quiz Questions
Q9: What is the variance of a Bernoulli distribution with parameter p = 0.3?
A: Var(X) = p(1-p) = 0.3 × 0.7 = 0.21. The variance measures the spread of the distribution, reaching maximum at p = 0.5.
Q10: In a standard normal distribution N(0,1), what is the probability of observing a value greater than 2?
A: Approximately 2.28% (or 0.0228). Since 95% of values fall within ±2σ, the remaining 5% is split equally between both tails.
Q11: Why is the softmax function called "soft" max rather than hard max?
A: Unlike argmax which returns a single 1 for the maximum value, softmax produces a probability distribution where larger values get higher probabilities but all values contribute, making it differentiable and suitable for gradient-based learning.
Q12: What happens to the entropy of a categorical distribution as one class probability approaches 1?
A: Entropy approaches 0 (minimum). When one outcome is certain, there's no uncertainty to measure. Maximum entropy occurs with uniform distribution where all classes are equally likely.
Q13: In weight initialization, why is sampling from a uniform distribution with small range (e.g., [-0.01, 0.01]) preferred over a large range?
A: Small initial weights prevent saturation of activation functions (like sigmoid/tanh) and help gradients flow during backpropagation. Large weights can cause neurons to saturate, leading to vanishing gradients.