The Mathematics

Mathematical concepts used in this paper

Concept: Derivatives Why needed: Backpropagation computes how much the loss changes when any weight changes — that is a derivative. Without derivatives, there is no gradient descent, and without gradient descent there is no learning. Where in paper: Every weight update uses dL/dw for that weight. Tutorial: Derivatives — Introduction

Concept: Chain Rule Why needed: The loss depends on the output, which depends on the hidden layer, which depends on the weights. The chain rule is the mathematical tool for computing derivatives of such composed functions. Backpropagation is the chain rule, systematically applied. Where in paper: Every backward pass step is a chain rule application. Tutorial: Chain Rule

Concept: Partial Derivatives Why needed: The loss depends on millions of weights simultaneously. A partial derivative ∂L/∂wᵢ captures how loss changes with wᵢ while holding all other weights fixed. We need one per weight. Where in paper: The gradient ∇L is the vector of all partial derivatives. Tutorial: Partial Derivatives

Concept: Gradient Descent Why needed: Once we have computed ∂L/∂wᵢ for every weight, we update each weight by stepping opposite to the gradient. This is gradient descent — the optimisation algorithm that actually trains the network. Where in paper: The weight update rule: w ← w - η × ∂L/∂w Tutorial: Gradient Intuition

The key equations

The sigmoid activation function:

σ(z) = 1 / (1 + e^(-z))

Where:

z = the weighted sum input to the neuron (also called pre-activation)
e ≈ 2.718 (Euler’s number)
σ(z) outputs a value strictly between 0 and 1

The sigmoid’s crucial property: it is differentiable everywhere, so the chain rule can flow through it. Its derivative is:

σ'(z) = σ(z) × (1 - σ(z))

This derivative is easy to compute — if you already know σ(z), you get σ’(z) for free.

The loss function (squared error):

L = (y - ŷ)²

Where:

y = correct answer (given in training data)
ŷ = network’s prediction

Derivative:

∂L/∂ŷ = -2(y - ŷ)

The weight update rule (gradient descent):

w_new = w_old - η × ∂L/∂w

Where:

η (eta) = learning rate (e.g. 0.01, 0.001)
∂L/∂w = the gradient computed by backpropagation

The backpropagation equations in full generality

For a network with L layers, define:

aˡ = output (activation) of layer l
zˡ = pre-activation of layer l: zˡ = Wˡ aˡ⁻¹ + bˡ
δˡ = the error signal (delta) at layer l

Forward pass:

zˡ = Wˡ aˡ⁻¹ + bˡ
aˡ = σ(zˡ)

Backward pass — output layer delta:

δᴸ = ∇ₐL ⊙ σ'(zᴸ)

Where ⊙ means element-wise multiplication.

Backward pass — propagate delta to earlier layers:

δˡ = ((Wˡ⁺¹)ᵀ δˡ⁺¹) ⊙ σ'(zˡ)

Gradient for weights at layer l:

∂L/∂Wˡ = δˡ (aˡ⁻¹)ᵀ

Weight update:

Wˡ ← Wˡ - η × ∂L/∂Wˡ

These four equations completely describe backpropagation. Everything else is implementation details.

Worked numerical verification

We verify the result from Section 4’s step-by-step walkthrough.

Network: x=[1,2], w₁=0.5, w₂=0.3, w₃=0.8, y=1, η=0.1

Forward:

z_h = 0.5×1 + 0.3×2 = 1.1
h   = σ(1.1) ≈ 0.750

z_out = 0.8 × 0.750 = 0.600
ŷ    = σ(0.600) ≈ 0.646

L = (1 - 0.646)² ≈ 0.125

Backward:

δ_out = -(y-ŷ) × ŷ(1-ŷ) = -0.354 × 0.229 ≈ -0.0811

∂L/∂w₃ = δ_out × h = -0.0811 × 0.750 = -0.0608
→ w₃: 0.800 → 0.806

δ_h = δ_out × w₃ × h(1-h) = -0.0811 × 0.800 × 0.1875 = -0.01217

∂L/∂w₁ = δ_h × x₁ = -0.01217 × 1 = -0.01217 → w₁: 0.500 → 0.501
∂L/∂w₂ = δ_h × x₂ = -0.01217 × 2 = -0.02434 → w₂: 0.300 → 0.302

(Slight differences from Section 4 are due to rounding at intermediate steps. The direction of all updates is confirmed: all weights increase, moving the prediction toward 1.)

Next: The Code →