The Core Idea
The key insight
The credit assignment problem seemed unsolvable because hidden neurons have no direct error signal. You cannot tell them “you were wrong by this much.”
Rumelhart, Hinton, and Williams’ insight was: you don’t need to measure a hidden neuron’s error directly. You can calculate it mathematically by working backwards from the output error.
The output layer knows its error — the difference between prediction and correct answer. Each output neuron’s error came partly from each hidden neuron that fed into it. And the chain rule of calculus tells you exactly how to compute those contributions.
So: propagate the error signal backwards through the network, layer by layer, using the chain rule. Each layer receives a “blame signal” from the layer ahead of it, uses the chain rule to compute its own gradient, updates its weights, and passes the blame further back.
This is backpropagation — backward propagation of errors.
The analogy: tracing a factory defect backwards
Imagine a biscuit factory with three stages: mixing ingredients, baking, and packaging.
At the end, quality control finds that 15% of biscuits are too salty. They measured the final product — the output. Now they need to find where in the process the saltiness entered.
They work backwards:
- Packaging adds no salt — not responsible
- Baking does not add salt either — rule it out
- Mixing — the mixer is adding too much salt. Found it.
And crucially: they can quantify the blame. The salt in the biscuit came 100% from mixing — so mixing is 100% responsible. If mixing contributed 50% and packaging had added a preservative that amplified saltiness, then responsibility is split accordingly.
Backpropagation is exactly this: working backwards from the final error, using the mathematical structure of the network to assign proportional blame to each parameter — including those buried deep inside.
A second analogy: the exam coaching chain
Priya tutors Arjun, who tutors Sunita. Sunita sits the JEE exam and gets a poor score in mathematics.
How much is Priya responsible?
To find out: how much of Sunita’s weakness came from Arjun’s teaching? And how much of Arjun’s weakness came from Priya’s teaching? The chain of responsibility flows backwards: Sunita’s score ← Arjun’s teaching quality ← Priya’s teaching quality.
The chain rule tells you how to quantify each link in this chain. Backpropagation applies this logic to millions of neurons simultaneously.
The two passes
Backpropagation works in two passes through the network:
Forward pass (left to right): Input data enters. Each layer computes its output from the layer before it. Information flows forward until the final output is produced. The output is compared to the correct answer, and the loss (error) is computed.
Backward pass (right to left): Starting from the loss, gradients are computed layer by layer in reverse. Each layer receives the gradient from the layer ahead of it, uses the chain rule to compute gradients for its own weights, and passes gradients further back. When the backward pass finishes, every weight in the network has its gradient — a number saying “if you increase this weight by a tiny amount, the loss changes by this much.”
Then gradient descent updates all weights simultaneously.
What “representations” means
The paper’s title says “Learning Representations.” This word is key.
When a multi-layer network trains on images of cats and dogs, the hidden layers do not just learn to tell cats from dogs. They develop intermediate representations of the input — internal encodings that capture useful features: edges, textures, shapes, parts of faces.
These representations were not designed by anyone. They emerged from training. The network discovered, on its own, that certain internal patterns are useful for the task.
This is why deep learning is so powerful: it learns not just to classify, but to represent data in ways that make classification (and translation, and generation, and reasoning) easier. Every layer builds on the representations of the layer before it, constructing increasingly abstract and useful descriptions of the input.
This idea — that learning useful representations is the central task — runs through all 24 papers in this timeline. It was first articulated clearly here.
Next: How It Works →