1. Context — the RNN wall

By 2017, the AI research community had a clear picture of both what neural networks could do and where they were hitting a ceiling.

Three years of rapid progress had followed Bahdanau’s attention paper. Every competitive machine translation system now used some form of attention on top of an encoder-decoder LSTM. The alignment heatmaps were beautiful, BLEU scores were climbing, and attention had spread beyond translation into reading comprehension, image captioning, and speech recognition. The field was healthy and advancing.

But anyone training these models at scale could feel a common frustration: recurrent networks were painfully slow to train.

The problem was fundamental. An LSTM processes a sequence one step at a time. To compute the hidden state at position 5, you must first compute position 4. To compute position 4, you need position 3. There is no way around this chain of dependencies — it is the definition of a recurrent network. On a GPU with thousands of parallel processing cores, this sequential chain sits idle most of the time, waiting for the previous step to finish.

For a sentence of 50 words with a hidden size of 1000, the computation could not be parallelised across the sequence dimension at all. Training on millions of sentence pairs meant waiting days for a single experiment. Researchers were trying to scale their models — more layers, bigger hidden sizes, more data — and the sequential bottleneck made this increasingly painful.

Meanwhile, a separate set of ideas was floating around about attention alone. Researchers had noticed that attention weights were doing most of the interesting work — deciding which words to look at and how to combine them. The LSTM around the attention was arguably just scaffolding. What if you removed the scaffolding?

At Google Brain and Google Research in 2017, eight researchers asked exactly this question. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin worked on a model built entirely from attention operations, with no recurrence whatsoever.

They gave it a deliberately provocative name: “Attention Is All You Need.”

The paper was presented at NeurIPS in December 2017. It achieved a new state of the art on English-to-French translation — 41.0 BLEU — surpassing all previous single-model results and all ensemble models too. And it trained in a fraction of the time of recurrent models, because every position could be processed in parallel.

But the true impact of the paper was not its translation scores. It was the architecture it introduced: the Transformer. Every major AI system in the world today — GPT, Claude, Gemini, BERT, T5, LLaMA, Stable Diffusion — is a direct descendant of the architecture described in this paper. You are likely interacting with a system built on these equations right now.