Lesson 01: The XOR Problem | Level 02

Video: Understanding why XOR requires multiple layers (2:34)

The Puzzle That Stumped Early AI

In 1969, Marvin Minsky and Seymour Papert published a book called "Perceptrons" that had a devastating impact on neural network research. They proved something shocking: a single perceptron cannot solve the XOR problem.

💡 What is XOR?

XOR (exclusive OR) outputs 1 when exactly one of the inputs is 1. It's like asking "Is one or the other true, but not both?"

The XOR Truth Table

0 XOR 0 0

0 XOR 1 1

1 XOR 0 1

1 XOR 1 0

Notice the pattern: the 1s form a diagonal!

Why Can't a Single Line Work?

Try to draw a single straight line that separates the red dots (output=1) from the blue dots (output=0). It's impossible!

Line Rotation

Line Position

The Key Insight

XOR is not linearly separable. No matter how you rotate or position a single straight line, you cannot separate the positive cases (0,1 and 1,0) from the negative cases (0,0 and 1,1).

The Solution: Multiple Lines

While one line can't solve XOR, two lines can! This is the foundation of multi-layer networks.

🔑 The Multi-Layer Trick

First layer: Two perceptrons each draw a line
Second layer: A perceptron combines the outputs

This insight - that stacking layers allows us to solve non-linear problems - is what made the deep learning revolution possible decades later.

Linear Separability

A dataset is linearly separable if there exists a hyperplane that can separate the positive and negative examples.

Mathematical Definition

For a binary classification problem with data points (xᵢ, yᵢ) where yᵢ ∈ {-1, +1}, the dataset is linearly separable if there exists a weight vector w and bias b such that:

            yᵢ(w · xᵢ + b) > 0 for all i
          

The XOR Problem

The XOR function is defined as:

            XOR(x₁, x₂) = (x₁ ∧ ¬x₂) ∨ (¬x₁ ∧ x₂)
          

Theorem (Minsky & Papert, 1969): The XOR problem is not linearly separable in ℝ².

Proof

Assume for contradiction that XOR is linearly separable. Then there exist weights w₁, w₂ and bias b such that:

w₁·0 + w₂·0 + b < 0 → b < 0
w₁·0 + w₂·1 + b > 0 → w₂ + b > 0
w₁·1 + w₂·0 + b > 0 → w₁ + b > 0
w₁·1 + w₂·1 + b < 0 → w₁ + w₂ + b < 0

From inequalities 2 and 3: w₁ + w₂ + 2b > 0
From inequality 4: w₁ + w₂ + b < 0
From inequality 1: b < 0, so w₁ + w₂ + 2b < w₁ + w₂ + b < 0

This contradicts w₁ + w₂ + 2b > 0. ∎

Multi-Layer Solution

A two-layer network can solve XOR:

First Layer (Hidden)

Two perceptrons compute intermediate features:

            h₁ = σ(w₁₁x₁ + w₁₂x₂ + b₁)

            h₂ = σ(w₂₁x₁ + w₂₂x₂ + b₂)

Second Layer (Output)

One perceptron combines the hidden features:

            y = σ(v₁h₁ + v₂h₂ + c)
          

With appropriate weights, this architecture can learn the XOR function. This demonstrates the universal approximation capability of multi-layer networks.