The Mathematics
Mathematical concepts used in this paper
Concept: Derivatives Why needed: Backpropagation computes how much the loss changes when any weight changes — that is a derivative. Without derivatives, there is no gradient descent, and without gradient descent there is no learning. Where in paper: Every weight update uses dL/dw for that weight. Tutorial: Derivatives — Introduction
Concept: Chain Rule Why needed: The loss depends on the output, which depends on the hidden layer, which depends on the weights. The chain rule is the mathematical tool for computing derivatives of such composed functions. Backpropagation is the chain rule, systematically applied. Where in paper: Every backward pass step is a chain rule application. Tutorial: Chain Rule
Concept: Partial Derivatives Why needed: The loss depends on millions of weights simultaneously. A partial derivative ∂L/∂wᵢ captures how loss changes with wᵢ while holding all other weights fixed. We need one per weight. Where in paper: The gradient ∇L is the vector of all partial derivatives. Tutorial: Partial Derivatives
Concept: Gradient Descent Why needed: Once we have computed ∂L/∂wᵢ for every weight, we update each weight by stepping opposite to the gradient. This is gradient descent — the optimisation algorithm that actually trains the network. Where in paper: The weight update rule: w ← w - η × ∂L/∂w Tutorial: Gradient Intuition
The key equations
The sigmoid activation function:
σ(z) = 1 / (1 + e^(-z))
Where:
- z = the weighted sum input to the neuron (also called pre-activation)
- e ≈ 2.718 (Euler’s number)
- σ(z) outputs a value strictly between 0 and 1
The sigmoid’s crucial property: it is differentiable everywhere, so the chain rule can flow through it. Its derivative is:
σ'(z) = σ(z) × (1 - σ(z))
This derivative is easy to compute — if you already know σ(z), you get σ’(z) for free.
The loss function (squared error):
L = (y - ŷ)²
Where:
- y = correct answer (given in training data)
- ŷ = network’s prediction
Derivative:
∂L/∂ŷ = -2(y - ŷ)
The weight update rule (gradient descent):
w_new = w_old - η × ∂L/∂w
Where:
- η (eta) = learning rate (e.g. 0.01, 0.001)
- ∂L/∂w = the gradient computed by backpropagation
The backpropagation equations in full generality
For a network with L layers, define:
- aˡ = output (activation) of layer l
- zˡ = pre-activation of layer l: zˡ = Wˡ aˡ⁻¹ + bˡ
- δˡ = the error signal (delta) at layer l
Forward pass:
zˡ = Wˡ aˡ⁻¹ + bˡ
aˡ = σ(zˡ)
Backward pass — output layer delta:
δᴸ = ∇ₐL ⊙ σ'(zᴸ)
Where ⊙ means element-wise multiplication.
Backward pass — propagate delta to earlier layers:
δˡ = ((Wˡ⁺¹)ᵀ δˡ⁺¹) ⊙ σ'(zˡ)
Gradient for weights at layer l:
∂L/∂Wˡ = δˡ (aˡ⁻¹)ᵀ
Weight update:
Wˡ ← Wˡ - η × ∂L/∂Wˡ
These four equations completely describe backpropagation. Everything else is implementation details.
Worked numerical verification
We verify the result from Section 4’s step-by-step walkthrough.
Network: x=[1,2], w₁=0.5, w₂=0.3, w₃=0.8, y=1, η=0.1
Forward:
z_h = 0.5×1 + 0.3×2 = 1.1
h = σ(1.1) ≈ 0.750
z_out = 0.8 × 0.750 = 0.600
ŷ = σ(0.600) ≈ 0.646
L = (1 - 0.646)² ≈ 0.125
Backward:
δ_out = -(y-ŷ) × ŷ(1-ŷ) = -0.354 × 0.229 ≈ -0.0811
∂L/∂w₃ = δ_out × h = -0.0811 × 0.750 = -0.0608
→ w₃: 0.800 → 0.806
δ_h = δ_out × w₃ × h(1-h) = -0.0811 × 0.800 × 0.1875 = -0.01217
∂L/∂w₁ = δ_h × x₁ = -0.01217 × 1 = -0.01217 → w₁: 0.500 → 0.501
∂L/∂w₂ = δ_h × x₂ = -0.01217 × 2 = -0.02434 → w₂: 0.300 → 0.302
(Slight differences from Section 4 are due to rounding at intermediate steps. The direction of all updates is confirmed: all weights increase, moving the prediction toward 1.)
Next: The Code →