Section 04

How it works

Long Short-Term Memory 1997

4. How it works — one LSTM step, drawn out

We will now walk through exactly what happens inside an LSTM cell at time step t. Read slowly. Every single symbol matters, and the whole paper is really just this tiny dance of five equations repeated in a loop.

What goes in, what comes out

At step t, the LSTM cell receives three things:

  • xₜ — the new input at this step (say, today’s word in a sentence).
  • hₜ₋₁ — the hidden state from the previous step (the student’s inner voice at the end of the last sentence).
  • cₜ₋₁ — the cell state from the previous step (the student’s notebook so far).

It produces two things:

  • hₜ — the new hidden state. This is what gets passed up to the next layer or used to make a prediction.
  • cₜ — the new cell state. This just gets passed to the next time step. Nothing else reads it directly.

The five moves, in order

Imagine a student in a physics class, sitting with a notebook open. A new sentence from the teacher just arrived (that is xₜ). Here is what happens, step by step:

Move 1 — look at the inputs together

Concatenate hₜ₋₁ (what I was just thinking) and xₜ (what the teacher just said) into one long vector. Call this joint vector [hₜ₋₁, xₜ]. Every gate will look at this joint vector — that’s how the LSTM makes each decision in light of both the past and the present.

Move 2 — decide what to erase from the notebook (forget gate)

Push the joint vector through a small linear layer followed by a sigmoid:

fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f)

fₜ is a vector the same size as the cell state. Every entry is between 0 and 1. A value near 1 means “keep this notebook slot as it is”. A value near 0 means “erase this slot”.

Analogy: the student reviews the notebook, pen in hand, and strikes out lines that are no longer relevant.

Move 3 — decide what new information to write down (input gate)

This happens in two micro-steps that run in parallel:

3a. Compute a candidate of what we could add. A tanh layer produces a vector of possible updates, each between −1 and +1:

c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c)

Think of c̃ₜ as a rough draft of what the student might write down.

3b. Compute a gate that decides how much of the draft to actually write. Another sigmoid layer:

iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i)

iₜ is again between 0 and 1 per slot.

Move 4 — update the notebook (cell state update)

Now put it all together:

cₜ = fₜ ⊙ cₜ₋₁  +  iₜ ⊙ c̃ₜ

The symbol means element-wise multiplication (multiply slot 1 by slot 1, slot 2 by slot 2, etc.). Read this equation out loud:

“My new notebook equals my old notebook, with some lines erased (multiplied by the forget gate), plus some new lines added in (scaled by the input gate).”

That’s it. That’s the whole memory update. Crucially, this is the only thing that happens to the cell state. There is no matrix multiply by a big random W, no deeply stacked nonlinearities — just addition and element-wise multiplication.

That simplicity is why gradients survive. When we later differentiate this equation with respect to cₜ₋₁, the answer is essentially just fₜ. If the forget gate stays near 1 for an important slot, the gradient through that slot travels backward almost untouched. That is the “long” in “long short-term memory”.

Move 5 — read out the hidden state (output gate)

The new cell state exists, but we still need to produce hₜ — the thing the rest of the network can see and use. Two small steps:

5a. Decide what parts of the cell state to emit. A sigmoid layer looking at the same joint vector:

oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o)

5b. Produce the hidden state. Squash the cell state through tanh and multiply by the output gate:

hₜ = oₜ ⊙ tanh(cₜ)

Analogy: the student has a full notebook, but when asked a question they only read out the specific lines relevant to the question. The output gate is that selective reading.

The whole step, as one block

Put the five moves together, and an LSTM cell at step t is exactly:

fₜ  = σ(W_f · [hₜ₋₁, xₜ] + b_f)          # forget gate
iₜ  = σ(W_i · [hₜ₋₁, xₜ] + b_i)          # input gate
c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c)        # candidate content
cₜ  = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ                # cell state update
oₜ  = σ(W_o · [hₜ₋₁, xₜ] + b_o)          # output gate
hₜ  = oₜ ⊙ tanh(cₜ)                      # hidden state

Six lines. That’s the entire LSTM. You will see these exact equations in every textbook, every blog post, every paper. Now you know what every symbol means and why it is there.

Why this design, and not something simpler?

Students often ask: could we not just use one gate? Why three?

  • Forget alone would let you erase, but never add new information.
  • Input alone would let you write, but the notebook would fill up with garbage because you could never clear old stuff.
  • Output alone would let you read, but the notebook would be whatever you copied in blindly at every step.

Each gate does one job. Together they give the network full control over its own memory — the first architecture in which a neural network explicitly learns to read and write to itself.

A worked miniature in plain numbers

Say the cell state has just 2 slots. At step t:

  • cₜ₋₁ = [2.0, −1.0] (two lines in the notebook)
  • Forget gate fires fₜ = [0.9, 0.1] — keep line 1, mostly erase line 2.
  • Candidate c̃ₜ = [0.3, 0.8].
  • Input gate iₜ = [0.0, 0.7] — do not add to line 1, add 70% of the candidate to line 2.

Then:

cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ
   = [0.9·2.0, 0.1·(−1.0)] + [0.0·0.3, 0.7·0.8]
   = [1.80,  −0.10]       + [0.00,   0.56]
   = [1.80,   0.46]

Line 1 stayed almost the same. Line 2 was mostly wiped and then rewritten with fresh content. Notice how additive the whole thing is — the old memory is not multiplied by a huge matrix, it is just gently filtered.

You will work through a full numeric example, including all six equations, in the math section.

Next: the math, fully worked.