9. Summary — one page on Bahdanau attention
The paper in one sentence
When neural networks translate language, the decoder should look back at every source word with learned relevance weights at each generation step — not translate blindly from a single compressed memory.
The problem it solved
The seq2seq model (2014) required the entire source sentence to be compressed into a single fixed-size vector before decoding. For long sentences, this bottleneck caused information loss and degraded translation quality. The model could not remember the beginning of the sentence by the time it reached the end.
The core idea
Attention weights: At each decoding step t, the model scores every source hidden state hᵢ against the decoder’s current state sₜ₋₁. These scores are converted to probabilities (via softmax) called attention weights αₜᵢ.
Context vector: A fresh context vector cₜ is computed as the weighted sum of all encoder states:
cₜ = Σᵢ αₜᵢ · hᵢ
Bidirectional encoder: A forward LSTM and a backward LSTM process the source simultaneously. Their states are concatenated, giving each source position a representation that reflects its full sentence context.
The fixed bottleneck is replaced by a dynamic, query-dependent lookup.
The analogy
A student writing a board exam essay does not memorise the textbook and close it. She keeps it open, glancing back at the relevant paragraph for each sentence she writes. The encoder states are the open textbook. The attention weights are where her eyes point. The context vector is the information her eyes just extracted.
The key equations
eₜᵢ = vₐᵀ · tanh(Wₐ · sₜ₋₁ + Uₐ · hᵢ) ← alignment score (additive attention)
αₜᵢ = exp(eₜᵢ) / Σⱼ exp(eₜⱼ) ← attention weight (softmax)
cₜ = Σᵢ αₜᵢ · hᵢ ← context vector (weighted sum)
sₜ = f(sₜ₋₁, yₜ₋₁, cₜ) ← decoder update
What it unlocked
- Standard practice in NMT within months
- Luong attention (2015): simplified dot-product variant
- The Transformer (2017, Paper 08): removed the RNN entirely, kept the attention equations
- Every modern language model (GPT, BERT, Claude, Gemini) uses the direct descendant of these equations
What it left open
- Quadratic O(T×S) complexity limits sequence length
- Still sequential (RNN bottleneck) — slow to train at scale
- Only cross-attention (decoder→encoder), not self-attention within a sequence
- Additive attention is slower than dot-product variants
Difficulty
🔴 The math (Section 4) is advanced undergrad — matrix products, tanh, softmax. 🟡 The concept (Sections 1–3) and the code (Section 6) are first-year college. 🟢 The summary and analogy are accessible to anyone.
Next paper: Paper 08 — Attention Is All You Need (Transformer) → Back to: Paper 06 — Seq2Seq →