The Problem

What went wrong with the Perceptron

The Perceptron could learn, but only linearly separable patterns. The fix seemed obvious: add a hidden layer between the input and the output. A two-layer network can solve XOR. A three-layer network can approximate any function.

The problem: how do you train it?

For the output layer, training is straightforward. You compare the output to the correct answer, compute the error, and adjust the output layer’s weights using the Perceptron learning rule.

But the hidden layer? Its outputs are not directly compared to any training label. When the network makes a mistake at the output, you know the output layer is wrong. But is the hidden layer wrong? Which neurons in the hidden layer contributed to the mistake? By how much? In which direction should their weights move?

This is the credit assignment problem: when the final output is wrong, how do you assign blame to the neurons buried deep inside the network, far from the output where the error is measured?

Why it had seemed impossible

For seventeen years, people had tried and failed to solve this.

Some attempted random perturbations — nudge each weight a tiny amount and see if the loss goes up or down. This works in principle but requires one forward pass per weight. A network with 1,000 weights needs 1,000 forward passes per training step. Completely impractical for anything real.

Some tried to define “local” learning rules — rules that only use information available at each neuron without global coordination. These worked for simple cases but did not generalise.

The core difficulty seemed fundamental: the hidden layer neurons have no “correct answer” to compare against. You cannot directly measure their error. How can you correct something if you cannot measure how wrong it is?

The plain-language problem statement

How can a multi-layer neural network learn from its mistakes — adjusting the weights of hidden neurons that have no direct feedback about whether they are right or wrong?

In technical language:

How can we efficiently compute the gradient of a scalar loss function with respect to every weight in a multi-layer feedforward network, enabling gradient descent to train all layers simultaneously?

Why it mattered

The answer to this question is worth trillions of dollars and has changed the world.

Every product that uses deep learning — and that is most AI products today — depends on being able to train multi-layer networks. Without backpropagation:

There are no image classifiers (multi-layer convolutional networks)
There are no language models (GPT, Claude, BERT — all trained with backprop)
There is no speech recognition, no machine translation, no protein folding
There is no modern AI

The credit assignment problem was not an academic curiosity. It was the single bottleneck between primitive single-layer perceptrons and the full power of deep learning.

Why a student in small-town India should care

When you use Google Translate to translate an English article to Hindi, a multi-layer neural network is doing the translation. When the government uses AI to verify your Aadhaar details from a photo, a multi-layer neural network reads your face. When Sarvam AI processes speech in ten Indian languages, multi-layer networks do the work.

All of these were made possible by a four-page paper published in 1986 that solved the credit assignment problem. Understanding how they work starts here.

Next: The Core Idea →