Paper 07
Intermediate

Neural Machine Translation by Jointly Learning to Align and Translate

Paper 07 — Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho & Bengio · 2014 · arXiv:1409.0473


What this paper did

It broke the translation bottleneck.

The seq2seq model (Paper 06) forced all source information through a single fixed-size vector. For long sentences, this was like trying to describe an entire film using only one sentence — details get lost. Translation quality collapsed on anything beyond ~20 words.

Bahdanau’s team replaced the fixed vector with a dynamic attention mechanism: at each decoding step, the decoder computes a fresh context vector by taking a weighted sum of all encoder hidden states, where the weights reflect how relevant each source word is right now. These weights are the attention weights.

The result: the model never has to fully compress the source. It can look back at any part of it, at any decoding step, with any weight it chooses. Long sentences stopped being a problem. And the attention weights, visualised as a heatmap, showed that the model had independently learned to align source and target words — something linguists had catalogued by hand for decades.


The Indian analogy

A student answering a board exam essay question does not memorise the textbook and close it. She keeps it open, glancing back at the relevant paragraph for each sentence she writes. The encoder’s hidden states are the open textbook. The attention weights decide where her eyes point. The context vector is what she just read.


Read in this order

SectionWhat you will learnDifficultyTime
1. ContextWhy translation needed this fix in 2014🟢4 min
2. The ProblemThe fixed context vector bottleneck🟢3 min
3. The IdeaAttention weights, soft alignment, bidirectional encoder🟢4 min
4. The MathAlignment scores, softmax, context vector — worked by hand🔴10 min
5. Worked ExampleFull decoding walkthrough with toy numbers🔴10 min
6. The CodeAttention in 25 lines of NumPy🟡6 min
7. LimitationsQuadratic complexity, sequential bottleneck, no self-attention🟡4 min
8. ImpactHow this paper created the Transformer era🟢4 min
9. SummaryOne-page recap🟢3 min

Also: Glossary · Quiz · Further Reading


Before you read: math tutorials you need


The key equations

eₜᵢ  = vₐᵀ · tanh(Wₐ · sₜ₋₁ + Uₐ · hᵢ)     ← alignment score (how relevant is source word i at step t?)
αₜᵢ  = exp(eₜᵢ) / Σⱼ exp(eₜⱼ)               ← attention weight (probability, sums to 1)
cₜ   = Σᵢ αₜᵢ · hᵢ                           ← context vector (fresh at every step)
sₜ   = f(sₜ₋₁, yₜ₋₁, cₜ)                    ← decoder update

Paper 06 — Seq2Seq    → Paper 08 — Transformer

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.