Section 03

The core idea

Long Short-Term Memory 1997

3. The core idea — a notebook plus three clerks

The big reframe

A plain RNN has one piece of memory: the hidden state h. It tries to do two jobs at once — serve as the network’s answer at this step, and store everything the network needs to remember for the future. These two jobs pull the memory in different directions, and BPTT tears the gradients apart.

Hochreiter and Schmidhuber’s idea is shockingly simple, once you see it:

Give the network a second memory. A protected one. Let the first memory do the talking, and let the second memory just remember.

That second memory is called the cell state, written c. It runs alongside the hidden state h like a second track on a train line. Inputs come in, the hidden state gets updated (and spoken out loud), but the cell state mostly keeps flowing forward, almost untouched.

An Indian-life analogy — a student’s running notes

You are sitting in a physics class. The teacher is explaining rotational motion. You do two things at once:

  • You speak in your head, reasoning about each sentence as it arrives. That running internal monologue is your hidden state h.
  • You take notes in a notebook — but not every word. You decide which lines to write down, occasionally strike out an earlier line that turned out to be wrong, and refer back to earlier pages when a new concept connects to an old one. That notebook is your cell state c.

Your internal monologue is loud and changes every second. Your notebook is quiet and changes carefully. Days later, long after the inner monologue has faded, the notebook is still readable.

This is exactly how an LSTM works.

The three gates — three clerks at three desks

The cell state is precious. It must not change randomly at every step, or we’re back to the whispered-commentary problem. So the LSTM puts three small neural networks in charge of deciding what happens to the notebook. We call these networks gates. Each gate is just a tiny dense layer followed by a sigmoid, which squashes every number to a value between 0 and 1 — a “how much” knob.

Gate 1 — the forget gate (fₜ)

“Which lines in the notebook should I strike out?”

Before writing anything new, the student reviews the existing notes. Some lines are now irrelevant — the teacher has moved from rotational motion to simple harmonic motion, and the old angular velocity equation is no longer needed. The student draws a line through it.

Mechanically, the forget gate produces a number between 0 and 1 for every slot in the cell state. 1 means “keep this line”. 0 means “erase it”. The gate looks at both the current input xₜ (the teacher’s new sentence) and the previous hidden state hₜ₋₁ (what the student was just thinking) to decide.

Gate 2 — the input gate (iₜ)

“Of the new ideas I could write down, which ones actually deserve a line?”

When the teacher says something new, not everything belongs in the notebook. “The exam is next Thursday” belongs. “Please close the window” does not. The input gate is the filter that decides how much of the candidate new information to add to the cell state.

It works in two parts:

  • A candidate vector (c̃ₜ) — what the student could write down, computed using a tanh layer (so it can go up or down, positive or negative, between −1 and +1).
  • A gate value (iₜ) — how much of each candidate should actually be written down. Also between 0 and 1.

The actual addition to the cell state is the element-wise product of these two: iₜ ⊙ c̃ₜ.

Gate 3 — the output gate (oₜ)

“Of everything in my notebook right now, which lines should I read out to answer this question?”

At the end of each step, the LSTM has to produce a hidden state hₜ — the thing passed to the next step and to the rest of the network. The output gate decides which parts of the cell state are relevant right now. The cell state might contain “we are still inside an if-statement” and “the current loop index is i” — but only one of those is relevant for predicting the next line of code.

The hidden state is computed as:

hₜ = oₜ ⊙ tanh(cₜ)

Here tanh(cₜ) squashes the notebook entries into a readable range, and oₜ picks which entries to emit.

How the three gates serve the three requirements

Recall the three things any fix had to achieve (from Section 2):

RequirementHow the LSTM delivers it
1. Keep information alive across many stepsThe cell state is updated by addition (and a forget multiplier), not replaced.
2. Let the network choose what to rememberThe forget and input gates are learned — they decide, per input, what to erase and add.
3. Let gradients flow backward undistortedThe cell state path has no tanh and no big matrix multiply; gradients travel almost freely.

That last row is the mathematical heart of the LSTM. We’ll see the exact equations in the math section. For now, the intuition is enough: the cell state is a highway, and the gates are the on-ramps and off-ramps.

One picture to hold in your head

                      ┌──────── forget gate ────────┐
                      ▼                              │
      cₜ₋₁ ──────────⊗──────────⊕────────► cₜ       │
                                ▲                    │
                                │                    │
                       input gate ⊗ candidate        │

      hₜ₋₁ ─────────┬──────────┼──────┬──────────────┘
                    │          │      │
      xₜ  ──────────┴──────────┴──────┴──── tanh(cₜ) ⊗ output gate ──► hₜ

The top line is the notebook (cell state). The bottom line is the student’s inner voice (hidden state). The three gates sit between them, deciding what to erase, what to add, and what to speak out loud.

Next: how it works, one step drawn out in detail.