Section 06

The Code

How Neural Networks Learn Learning Representations by Back-propagating Errors 1986

The Code

Backpropagation from scratch in NumPy

This implements a complete neural network — forward pass, loss, backward pass, weight updates — in 25 lines. No PyTorch, no TensorFlow. Every line maps directly to the mathematics in Section 5.

We train it to learn XOR — the problem that defeated the single-layer Perceptron.

# What this code does: Implements backpropagation from scratch for a 2-layer network
# Paper: Learning Representations by Back-propagating Errors (1986)
# Run free at: https://colab.research.google.com/

import numpy as np

# XOR training data (the problem the Perceptron could NOT solve)
X = np.array([[0,0], [0,1], [1,0], [1,1]])  # 4 examples, 2 inputs each
y = np.array([[0],   [1],   [1],   [0]])     # XOR outputs

# Initialise weights randomly (small numbers to start near zero)
np.random.seed(42)
W1 = np.random.randn(2, 4) * 0.5   # input→hidden: 2 inputs, 4 hidden neurons
W2 = np.random.randn(4, 1) * 0.5   # hidden→output: 4 hidden, 1 output
lr = 0.5                             # learning rate

def sigmoid(z):
    return 1 / (1 + np.exp(-z))     # squash any number to (0,1)

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1 - s)              # sigmoid's own derivative: σ(z)×(1-σ(z))

# Training loop — 10,000 steps
for step in range(10001):
    # ── FORWARD PASS ──────────────────────────────────────────
    z1 = X @ W1                     # (4,2)×(2,4) = (4,4): weighted sum, hidden layer
    a1 = sigmoid(z1)                # apply activation to hidden layer
    z2 = a1 @ W2                    # (4,4)×(4,1) = (4,1): weighted sum, output layer
    yhat = sigmoid(z2)              # final prediction (probability between 0 and 1)

    # ── LOSS ──────────────────────────────────────────────────
    loss = np.mean((y - yhat) ** 2) # mean squared error across all 4 examples

    # ── BACKWARD PASS ─────────────────────────────────────────
    # Output layer delta: how wrong is the output × how sensitive is sigmoid there
    d_out = -(y - yhat) * sigmoid_deriv(z2)   # shape (4,1)

    # Gradient for W2: hidden activations × output delta
    dW2 = a1.T @ d_out                         # shape (4,1)

    # Propagate error back to hidden layer (chain rule through W2 and sigmoid)
    d_hid = (d_out @ W2.T) * sigmoid_deriv(z1) # shape (4,4)

    # Gradient for W1: inputs × hidden delta
    dW1 = X.T @ d_hid                          # shape (2,4)

    # ── WEIGHT UPDATE (gradient descent) ──────────────────────
    W2 -= lr * dW2
    W1 -= lr * dW1

    if step % 2000 == 0:
        print(f"Step {step:5d} | Loss: {loss:.4f}")

# Test the trained network
print("\nTrained predictions vs correct XOR:")
for i in range(4):
    pred = sigmoid(sigmoid(X[i] @ W1) @ W2)[0]
    print(f"  Input {X[i]} → Predicted: {pred:.3f} | Correct: {y[i][0]}")

What you should see when you run this:

Step     0 | Loss: 0.2641
Step  2000 | Loss: 0.1823
Step  4000 | Loss: 0.0312
Step  6000 | Loss: 0.0089
Step  8000 | Loss: 0.0041
Step 10000 | Loss: 0.0024

Trained predictions vs correct XOR:
  Input [0 0] → Predicted: 0.045  | Correct: 0
  Input [0 1] → Predicted: 0.961  | Correct: 1
  Input [1 0] → Predicted: 0.962  | Correct: 1
  Input [1 1] → Predicted: 0.048  | Correct: 0

The network has learned XOR — the pattern the single-layer Perceptron could never learn. Predictions near 0 for (0,0) and (1,1), near 1 for (0,1) and (1,0). The loss goes from 0.26 to 0.002 — a 99% reduction.


What to change to experiment:

  1. Run without the hidden layer. Change W1 to shape (2,1) and remove the hidden layer computation — compute yhat = sigmoid(X @ W1) directly. Watch it fail to converge on XOR. The single-layer version cannot solve it, just as Minsky and Papert predicted.

  2. Try more hidden neurons. Change 4 to 2 in W1 = np.random.randn(2, 2). The network still learns XOR — because 2 hidden neurons are sufficient. Try 1 hidden neuron — does it still work?

  3. Watch the hidden layer learn representations. After training, print sigmoid(X @ W1) — the hidden layer activations for each input. You will see that the 4 inputs [0,0], [0,1], [1,0], [1,1] have been mapped to 4 distinct patterns in hidden space — representations that make XOR easy to solve.