2. The problem — vanishing gradients and the XOR echo

The setup in one picture

An RNN processes a sequence one step at a time. At each step t:

Take the new input xₜ (say, a word or a sound frame).
Take the hidden state hₜ₋₁ from the previous step.
Combine them: hₜ = tanh(W · xₜ + U · hₜ₋₁ + b).
Pass hₜ forward as both the output and as input to step t+1.

The hidden state is the memory. Everything the network “remembers” about the past has to be squeezed into this single vector hₜ.

An Indian-life analogy

Think of a cricket commentator who, between balls, must whisper a summary of the match so far into their own ear. Every ball, they listen to their old whisper, add the new ball’s result, and produce a fresh whisper for the next ball. By ball 200, the commentator is listening to a whisper that has been passed down 200 times — each pass slightly distorting the signal. Important information from ball 1 is almost certainly gone.

This is exactly how an RNN treats the past.

Why the gradient vanishes — the math intuition

When we train the RNN, we use backpropagation through time (BPTT): unroll the network across all time steps, then apply ordinary backpropagation (Paper 03) to the unrolled graph.

The gradient from the loss at time T has to travel backward to update the weights that mattered at time 1. That journey goes through every hidden state in between.

At each hop, the chain rule (see the chain rule tutorial) multiplies in a derivative that looks roughly like:

∂hₜ / ∂hₜ₋₁  ≈  U · tanh'(·)

Two things about this term:

tanh’(·) is at most 1, and usually much less — typically 0.1 to 0.4.
U is a weight matrix, initialised with small values (say, 0.5).

So each hop multiplies the gradient by something like 0.5 × 0.2 = 0.1.

Now travel backward 30 steps. The gradient gets multiplied by this small number 30 times:

0.1³⁰  ≈  10⁻³⁰

That is effectively zero. The weight updates that should teach the network “ball 1 matters for ball 200” are rounded off to nothing. The network cannot learn long-range connections.

(If instead U is initialised large, the gradient explodes in the opposite direction. Either way, the RNN fails.)

For a deeper feel for why this matters, see the gradient intuition tutorial.

The echo of XOR

Remember the XOR problem from Paper 02? A single perceptron could not learn XOR because XOR is not linearly separable — the positive and negative examples cannot be split by one straight line. Rosenblatt’s perceptron failed, and the AI field entered its first winter.

A plain RNN facing a long sequence is in structurally the same position. Consider this task:

Read a string of 20 characters. Output 1 if the first and last characters are the same, else 0.

This is, in essence, XOR stretched across time. To solve it, the network must carry information from position 1 all the way to position 20 without losing it. Plain RNNs cannot. The gradient vanishes long before position 20, so the network cannot even discover that the first character matters.

In both cases — perceptron with XOR, RNN with long sequences — the architecture was not expressive enough for the task. The model itself was the bottleneck.

And in both cases, the fix was the same idea dressed differently: add structure that the simple version was missing. The perceptron needed a hidden layer (Paper 03). The RNN needed a protected memory line. That memory line is what LSTMs invent.

The three things an architecture must do

Before we look at the LSTM, let’s name what any fix must achieve:

Keep information alive across many steps — the past cannot be compressed into a single distorted whisper.
Let the network choose what to remember — not everything from step 1 is useful at step 200.
Let gradients flow backward undistorted — so the training signal can reach early weights.

Hold these three requirements in mind. Every design decision in the LSTM exists to serve exactly one of them.

Next: the core idea — a memory line protected by three gates.