The Overfitting Problem
Neural networks have millions of parameters — they can memorize the training data. But we want them to generalize to new data.
Training vs Validation Loss
Underfitting: Both training and validation loss are high. Model too simple.
Good fit: Both losses low and similar. Model generalizes well.
Overfitting: Training loss low, validation loss high. Model memorized training data.
• High bias: Model too simple, underfits
• High variance: Model too complex, overfits
• Goal: Find the sweet spot
Regularization Techniques
L2 Regularization (Weight Decay)
Penalizes large weights. Encourages small, distributed weights.
L1 Regularization
Encourages sparse weights (many become exactly 0).
Dropout
Randomly set neurons to 0 during training. Forces redundancy.
Early Stopping
Stop training when validation loss stops improving.
Data Augmentation
Create more training data through transformations.
Batch Normalization
Normalize layer inputs. Has regularization effect.
L2 Regularization in Detail
The term (1 - 2αλ) causes weights to decay toward zero — hence "weight decay."
Dropout
During training, randomly set a fraction p of neurons to 0:
Why Dropout Works
- Ensemble effect: Each training step uses different "thinned" network
- Redundancy: Can't rely on any single neuron
- Co-adaptation: Prevents neurons from depending too much on specific others
Early Stopping
The simplest regularization: just stop training before overfitting:
Regularization in LLMs
Modern LLMs use different regularization:
- Weight decay: Still used (AdamW)
- Dropout: Often reduced or removed in very large models
- Gradient clipping: Prevents exploding gradients
- Large datasets: The best regularization is more data!
Knowledge Check
Question 1
What is the primary goal of regularization in neural networks?
- A) To speed up training time
- B) To prevent overfitting and improve generalization
- C) To increase model capacity
- D) To reduce memory usage
Show Answer
B) To prevent overfitting and improve generalization
Regularization techniques constrain the model to prevent it from memorizing training data, helping it perform better on unseen data.
Question 2
How does L2 regularization (weight decay) affect the weight update rule?
- A) It adds a constant to all weights
- B) It multiplies weights by (1 - 2αλ) before the gradient update
- C) It sets small weights to exactly zero
- D) It doubles the learning rate
Show Answer
B) It multiplies weights by (1 - 2αλ) before the gradient update
The weight decay term causes weights to shrink toward zero at each step, making the function smoother and less prone to overfitting.
Question 3
Why does dropout create an "ensemble effect"?
- A) It trains multiple separate networks simultaneously
- B) Each training iteration uses a different random subset of neurons
- C) It combines predictions from different models at test time
- D) It duplicates the network layers
Show Answer
B) Each training iteration uses a different random subset of neurons
Dropout randomly deactivates neurons during training, effectively training many different "thinned" networks that are averaged at test time.
Question 4
What is the main difference between L1 and L2 regularization?
- A) L1 uses the sum of squared weights; L2 uses the sum of absolute weights
- B) L1 encourages sparse weights (many zeros); L2 encourages small distributed weights
- C) L1 is only used in CNNs; L2 is only used in RNNs
- D) L1 requires more computation than L2
Show Answer
B) L1 encourages sparse weights (many zeros); L2 encourages small distributed weights
L1 regularization (sum of |w|) drives many weights to exactly zero, performing feature selection. L2 regularization (sum of w²) keeps all weights small.
Question 5
When should early stopping be triggered?
- A) When training loss reaches zero
- B) When validation loss stops improving for a specified number of epochs
- C) When the learning rate becomes too small
- D) When the model reaches a certain size
Show Answer
B) When validation loss stops improving for a specified number of epochs
Early stopping monitors validation loss and halts training when it hasn't improved for a set "patience" period, preventing overtraining.