1. Historical context — 1997, and the winter nobody talks about

By the mid-1990s, the AI world was quiet. The Perceptron hype (1958, Paper 02) had collapsed after Minsky and Papert’s XOR attack. Backpropagation (1986, Paper 03) had rescued things a decade later by showing that deeper networks could learn what a single perceptron couldn’t. For a while, it looked like the problem was solved.

Then researchers tried something new: they fed the network sequences.

Why sequences matter

Most real problems are not a single snapshot. They are a stream.

Speech is a sequence of sound frames.
A sentence is a sequence of words.
A cricket innings is a sequence of balls.
Your electricity bill is a sequence of monthly readings.

A normal feed-forward network — the kind we built in Papers 02 and 03 — takes one input and gives one output. It has no memory. Show it the word “bank” and it cannot know whether the previous word was “river” or “HDFC”.

To handle sequences, researchers in the 1980s invented the Recurrent Neural Network (RNN). The idea was beautifully simple: at each time step, feed the network not only the new input but also its own hidden state from the previous step. The network would then carry a small “summary of the past” forward in time, like a person carrying the memory of a conversation from one sentence to the next.

On paper, this was elegant. In practice, it broke in a strange way.

The discovery that started it all

In 1991, a graduate student named Sepp Hochreiter wrote his diploma thesis at TU München. His advisor was Jürgen Schmidhuber. The thesis, written in German, showed something devastating: when you train an RNN using backpropagation through time, the gradient either vanishes (shrinks toward zero) or explodes (blows up to infinity) as it travels backward through many time steps.

For most sequences longer than about 10 steps, the gradient vanished. The weights that governed long-range behaviour got updates of essentially zero. The network literally could not learn to connect something that happened in step 1 to something that happened in step 50.

In plain language: early RNNs had short-term memory only.

Hochreiter’s thesis is now considered one of the most important unpublished documents in deep learning. It did not cause a stir at the time because the AI community had mostly moved on to other methods — support vector machines, decision trees, hand-crafted features. Neural nets were unfashionable.

The 1997 paper

Six years after the thesis, Hochreiter and Schmidhuber published Long Short-Term Memory in the journal Neural Computation. It proposed an architectural fix for the exact problem the thesis had identified. The paper was long, dense, and full of mathematical proofs. It was cited fewer than 50 times in its first five years.

Nobody knew yet that this paper would become the backbone of Google Translate, Apple’s Siri, Amazon Alexa, DeepMind’s early reinforcement learning work, and almost every production sequence model built between 2005 and 2017.

It was a patient paper. It waited two decades for hardware, data, and the rest of the field to catch up.

Why the name “Long Short-Term Memory”

Read it carefully. It is not “long-term memory”. It is long short-term memory. Ordinary RNNs have a short-term memory that lasts only a few steps. LSTMs extend that short-term memory so it can last for hundreds of steps — but it is still, mechanically, a short-term working memory, not a permanent store. A good analogy is a student’s running notes during a lecture: not a textbook, not a diary — just a notebook that lives as long as the lecture does, which the student updates continuously.

What to carry into the next section

RNNs were the first serious attempt at sequence learning.
Hochreiter showed in 1991 that they silently failed on long sequences because the gradient vanished.
The 1997 LSTM paper proposed the fix — and we’re about to see why that fix needed such a strange architecture.

Next: the problem itself, visualised.