What Came Next

The problem this paper left on the table

Backpropagation solved credit assignment. You could now train multi-layer networks. XOR was learnable. Complex representations could emerge. The first AI winter was over.

But a new problem appeared the moment researchers tried to train deep networks — networks with many layers — or recurrent networks — networks that process sequences like text or speech by looping their output back as the next input.

The vanishing gradient problem struck.

In deep feedforward networks: gradients shrank to zero as they propagated back through many sigmoid layers. Early layers learned almost nothing.

In recurrent networks, the problem was even more severe. A recurrent network processes a sequence one element at a time — a word, a character, a stock price. To model long-range dependencies (“the pronoun at position 50 refers to the noun at position 3”), gradients had to travel back through 50 time steps. Multiplied by sigmoid derivatives at each step, they vanished completely before reaching the relevant early inputs.

In practice: recurrent networks trained with backpropagation through time (BPTT) could not learn dependencies spanning more than about 10 steps. For language, where meaning often depends on words far apart in a sentence, this was crippling.

The researchers who solved it

In 1997, Sepp Hochreiter and Jürgen Schmidhuber — a student and his advisor at the Technical University of Munich — published “Long Short-Term Memory.”

Hochreiter had identified the vanishing gradient problem clearly in his 1991 diploma thesis. For six years, he had been searching for a solution.

Their insight: the vanishing gradient problem happens because information at early time steps is overwritten as new inputs arrive. What if you gave the network a memory — a channel through which information could flow unchanged over many steps, bypassing the gradient-shrinking computation?

They designed an architecture with explicit memory cells and learned gates — valves that control what information to store, what to discard, and what to output at each time step. The gates are differentiable (you can backpropagate through them), but the memory cell itself can hold a value unchanged for hundreds of steps, allowing gradients to flow without shrinking.

This was the LSTM — Long Short-Term Memory.

What the LSTM enabled

LSTMs powered essentially all of natural language processing for two decades:

Machine translation (Google Translate used LSTMs until 2016)
Speech recognition (Apple’s Siri, Amazon’s Alexa)
Text generation
Sentiment analysis
Named entity recognition

Every time you spoke to a voice assistant in the 2010s, an LSTM was processing your words. Every time you used Google Translate between 2014 and 2016, LSTMs were doing the translation.

And crucially, LSTMs validated the core idea of backpropagation beyond question. By the late 1990s and early 2000s, LSTMs were consistently outperforming all alternatives on sequence tasks. The connectionist approach — neural networks trained by gradient descent — was proving itself empirically, even if theoretical understanding lagged.

Other directions from backpropagation

Convolutional networks (1989–2012): Yann LeCun applied backpropagation to convolutional architectures — networks with weight-sharing that are designed for images. LeNet (1989) could read handwritten digits. AlexNet (2012) could classify 1,000 categories of objects. CNNs trained with backpropagation became the dominant approach in computer vision.

Reinforcement learning: Researchers applied backpropagation not just to supervised prediction tasks but to learning from reward signals — training networks to play games, control robots, and optimise decisions. TD-Gammon (1992) played backgammon at human level using backpropagation. AlphaGo (2016) used it to defeat the world Go champion.

Generative models: Autoencoders — trained with backpropagation to compress and then reconstruct data — led to generative models for creating images, audio, and text.

Next paper: Long Short-Term Memory (1997) →

Hochreiter and Schmidhuber design the LSTM — giving neural networks a long-term memory and allowing them to process language, music, and time series over long horizons. The architecture that powered NLP for twenty years.