9. What came next — the road from LSTMs to transformers

The LSTM solved memory. That unlocked a cascade of follow-up problems, each of which produced a landmark paper. Here is the road from 1997 to the transformer era, as it unfolded.

Step 1 — give words themselves better representations

An LSTM trained on words is only as good as the numbers used to represent each word. Early systems fed in one-hot vectors — “the” was [1, 0, 0, …], “bank” was [0, 1, 0, …] — which threw away all meaning.

The answer came in 2013: Word2Vec (Paper 05). Train a small network to predict nearby words, and the hidden layer spontaneously learns vectors where similar words are close together. “King − man + woman ≈ queen” became the world’s most famous analogy. Word2Vec didn’t replace LSTMs — it fed them better inputs.

Read next: Paper 05 — Word2Vec

Step 2 — use two LSTMs, one to encode and one to decode

If LSTMs can read, they can also write. Sutskever, Vinyals, and Le (2014) connected two LSTMs back-to-back:

The encoder reads a sentence in French and squeezes it into a single context vector.
The decoder starts with that context vector and produces a sentence in English, one word at a time.

This was Seq2Seq (Paper 06), and it founded neural machine translation. It was also the architecture Google Translate used when they made the leap in 2016.

Read next: Paper 06 — Seq2Seq

Step 3 — break the bottleneck with attention

Seq2Seq had a hidden flaw: the decoder only saw the encoder’s final hidden state. For long sentences, that one vector was too small to carry everything. Bahdanau, Cho, and Bengio (2014) proposed attention: let the decoder, at each output step, look back at all the encoder’s hidden states and decide which ones to focus on.

This single fix dramatically improved translation quality on long sentences. And, critically, it planted a seed: what if attention wasn’t just a patch on top of LSTMs? What if attention was the whole model?

Read next: Paper 07 — Bahdanau Attention

Step 4 — throw away the LSTM entirely

Vaswani et al., 2017, asked the question directly in the title of their paper: “Attention Is All You Need”. They built a sequence model with no recurrence at all — only attention, plus a feed-forward block, plus some clever position encoding. It was the transformer (Paper 08).

The transformer solved every LSTM limitation at once:

Training parallelised across all time steps.
No hidden-state bottleneck — every position could attend to every other directly.
Scaled gracefully to billions of parameters and beyond.

Every language model you have heard of since — BERT, GPT, Claude, Gemini, LLaMA, Mistral, Mixtral — is a transformer.

The deeper lesson

Notice the shape of this arc. LSTMs solved the vanishing gradient problem by introducing carefully designed structure — gates, a cell state, six equations. Transformers, a decade later, solved the next set of problems (parallelism, bottlenecks, scaling) by removing structure — no recurrence, no hand-crafted memory, just a lot of attention and a lot of data.

This is a pattern that repeats in AI history: an elegant structured solution wins, rules for a decade, then loses to a simpler, dumber, more scalable alternative. We’ll see this pattern again with mixture-of-experts (Paper 09), scaling laws (Paper 13), and the shift to test-time compute (Paper 23).

You now have the background to read all the papers that follow. Each one builds on the others. Every modern AI system sits somewhere on this lineage.

Where to go from here

Read Paper 05 (Word2Vec) next — it’s a short, beautiful paper that changes how you think about words.
Revisit the glossary whenever a term feels fuzzy.
Try the quiz to test what stuck.
Browse the further reading for blog posts that go deeper on specific equations.

Thanks for sticking with all nine sections. The hard part is behind you — from here, every paper we read will feel like a variation on a theme you already know.