Lesson 6: Regularization

The Overfitting Problem

Neural networks have millions of parameters — they can memorize the training data. But we want them to generalize to new data.

Training vs Validation Loss

Underfitting: Both training and validation loss are high. Model too simple.

Good fit: Both losses low and similar. Model generalizes well.

Overfitting: Training loss low, validation loss high. Model memorized training data.

        Bias-Variance Tradeoff:

        • High bias: Model too simple, underfits

        • High variance: Model too complex, overfits

        • Goal: Find the sweet spot

Regularization Techniques

L2 Regularization (Weight Decay)

Loss = DataLoss + λ·Σw²

Penalizes large weights. Encourages small, distributed weights.

✓ Simple, effective, standard

L1 Regularization

Loss = DataLoss + λ·Σ|w|

Encourages sparse weights (many become exactly 0).

→ Feature selection

Dropout

Randomly set neurons to 0 during training. Forces redundancy.

✓ Very effective for deep networks

Early Stopping

Stop training when validation loss stops improving.

✓ Simple, prevents overtraining

Data Augmentation

Create more training data through transformations.

✓ More data = better generalization

Batch Normalization

Normalize layer inputs. Has regularization effect.

✓ Also speeds up training

L2 Regularization in Detail

# Loss with L2 regularization
loss = cross_entropy(predictions, targets) + λ * sum(w**2 for w in weights)

# Gradient update
∂loss/∂w = ∂data_loss/∂w + 2λw

# Weight update
w = w - α(∂data_loss/∂w + 2λw)
  = w - α·∂data_loss/∂w - 2αλw
  = (1 - 2αλ)w - α·∂data_loss/∂w
      

The term (1 - 2αλ) causes weights to decay toward zero — hence "weight decay."

        Why it works: Small weights mean the function is smoother (less wiggly). 
        Large weights allow the network to fit noise in the training data.
      

Dropout

During training, randomly set a fraction p of neurons to 0:

# Training with dropout (p = 0.5)
mask = (random.random() > 0.5)  # 50% chance of keeping
output = mask * activation(input)

# At test time, scale by p (or scale during training)
output = p * activation(input)  # Scale down
      

Why Dropout Works

Ensemble effect: Each training step uses different "thinned" network
Redundancy: Can't rely on any single neuron
Co-adaptation: Prevents neurons from depending too much on specific others

        Dropout Rate: Typically 0.2-0.5 for hidden layers. Not used on output layer. 
        In transformers, often replaced by other techniques.
      

Early Stopping

The simplest regularization: just stop training before overfitting:

best_val_loss = infinity
patience = 10  # How many epochs to wait
epochs_without_improvement = 0

for epoch in range(max_epochs):
    train()
    val_loss = validate()
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        save_checkpoint()
        epochs_without_improvement = 0
    else:
        epochs_without_improvement += 1
        
    if epochs_without_improvement >= patience:
        print("Early stopping!")
        break
      

Regularization in LLMs

Modern LLMs use different regularization:

Weight decay: Still used (AdamW)
Dropout: Often reduced or removed in very large models
Gradient clipping: Prevents exploding gradients
Large datasets: The best regularization is more data!

        Key Insight: With billions of parameters and trillions of tokens, 
        overfitting is less of a problem. The challenge becomes underfitting (not enough capacity 
        to learn patterns).
      

Knowledge Check

Question 1

What is the primary goal of regularization in neural networks?

A) To speed up training time
B) To prevent overfitting and improve generalization
C) To increase model capacity
D) To reduce memory usage

Show Answer

B) To prevent overfitting and improve generalization
Regularization techniques constrain the model to prevent it from memorizing training data, helping it perform better on unseen data.

Question 2

How does L2 regularization (weight decay) affect the weight update rule?

A) It adds a constant to all weights
B) It multiplies weights by (1 - 2αλ) before the gradient update
C) It sets small weights to exactly zero
D) It doubles the learning rate

Show Answer

B) It multiplies weights by (1 - 2αλ) before the gradient update
The weight decay term causes weights to shrink toward zero at each step, making the function smoother and less prone to overfitting.

Question 3

Why does dropout create an "ensemble effect"?

A) It trains multiple separate networks simultaneously
B) Each training iteration uses a different random subset of neurons
C) It combines predictions from different models at test time
D) It duplicates the network layers

Show Answer

B) Each training iteration uses a different random subset of neurons
Dropout randomly deactivates neurons during training, effectively training many different "thinned" networks that are averaged at test time.

Question 4

What is the main difference between L1 and L2 regularization?

A) L1 uses the sum of squared weights; L2 uses the sum of absolute weights
B) L1 encourages sparse weights (many zeros); L2 encourages small distributed weights
C) L1 is only used in CNNs; L2 is only used in RNNs
D) L1 requires more computation than L2

Show Answer

B) L1 encourages sparse weights (many zeros); L2 encourages small distributed weights
L1 regularization (sum of |w|) drives many weights to exactly zero, performing feature selection. L2 regularization (sum of w²) keeps all weights small.

Question 5

When should early stopping be triggered?

A) When training loss reaches zero
B) When validation loss stops improving for a specified number of epochs
C) When the learning rate becomes too small
D) When the model reaches a certain size

Show Answer

B) When validation loss stops improving for a specified number of epochs
Early stopping monitors validation loss and halts training when it hasn't improved for a set "patience" period, preventing overtraining.