Paper 08
Intermediate

Attention Is All You Need

Paper 08 — Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin · NeurIPS 2017 · arXiv:1706.03762


What this paper did

It replaced recurrence entirely.

Bahdanau’s attention (Paper 07) improved the decoder’s memory by letting it look back at all encoder states. But both encoder and decoder were still LSTMs — sequential by design, impossible to parallelise across the sequence.

The Transformer removed the LSTMs completely. Instead, every layer is built purely from attention operations and feed-forward networks — both of which process all positions in parallel. Training time dropped from days to hours. Long-range dependencies collapsed from O(T) steps to O(1).

The core equation:

Attention(Q, K, V) = softmax( Q · Kᵀ / √dₖ ) · V

Every position computes a Query vector (what am I looking for?), a Key vector (what do I offer?), and a Value vector (what do I send when selected?). All pairwise Query-Key scores are computed at once via matrix multiplication, normalised via softmax, and used to blend Value vectors. Eight of these attention “heads” run in parallel per layer.

Stack 6 encoder and 6 decoder layers of this, add positional encodings, layer normalisation, and residual connections, and you have the Transformer.


The Indian analogy

Instead of students answering one at a time (the RNN way), the whole classroom compares notes simultaneously. Every student sends a question (Query) to every other student, receives answers (Keys), decides how much to trust each answer (softmax weights), and blends the information (weighted Values). The classroom learns in parallel — one round of this gives everyone full context.


Read in this order

SectionWhat you will learnDifficultyTime
1. ContextThe RNN wall in 2017🟢4 min
2. The ProblemSequential bottleneck and long-range limits🟢3 min
3. The IdeaSelf-attention, Q/K/V, multi-head, positional encoding🟡5 min
4. The MathFull attention formula with numerical worked example🔴12 min
5. Worked ExampleOne full encoder layer on “The chai is hot”🔴12 min
6. The CodeScaled dot-product attention in NumPy🟡6 min
7. LimitationsQuadratic cost, positional encoding, compute requirements🟡4 min
8. ImpactBERT, GPT, AlphaFold, every AI system today🟢4 min
9. SummaryOne-page recap🟢3 min

Also: Glossary · Quiz · Further Reading


Before you read: math tutorials you need


The full architecture at a glance

INPUT TOKENS

Embedding + Positional Encoding

┌─────────────── Encoder × 6 ───────────────┐
│  Multi-Head Self-Attention                 │
│  Add & Norm                                │
│  Feed-Forward Network                      │
│  Add & Norm                                │
└────────────────────────────────────────────┘
    ↓ (encoder output to decoder cross-attention)
┌─────────────── Decoder × 6 ───────────────┐
│  Masked Multi-Head Self-Attention          │
│  Add & Norm                                │
│  Multi-Head Cross-Attention (←encoder)     │
│  Add & Norm                                │
│  Feed-Forward Network                      │
│  Add & Norm                                │
└────────────────────────────────────────────┘

Linear + Softmax → Output Probabilities

Paper 07 — Attention / Bahdanau    → Paper 09 — Mixture of Experts

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.