9. Summary — one page on Bahdanau attention

The paper in one sentence

When neural networks translate language, the decoder should look back at every source word with learned relevance weights at each generation step — not translate blindly from a single compressed memory.

The problem it solved

The seq2seq model (2014) required the entire source sentence to be compressed into a single fixed-size vector before decoding. For long sentences, this bottleneck caused information loss and degraded translation quality. The model could not remember the beginning of the sentence by the time it reached the end.

The core idea

Attention weights: At each decoding step t, the model scores every source hidden state hᵢ against the decoder’s current state sₜ₋₁. These scores are converted to probabilities (via softmax) called attention weights αₜᵢ.

Context vector: A fresh context vector cₜ is computed as the weighted sum of all encoder states:

cₜ = Σᵢ αₜᵢ · hᵢ

Bidirectional encoder: A forward LSTM and a backward LSTM process the source simultaneously. Their states are concatenated, giving each source position a representation that reflects its full sentence context.

The fixed bottleneck is replaced by a dynamic, query-dependent lookup.

The analogy

A student writing a board exam essay does not memorise the textbook and close it. She keeps it open, glancing back at the relevant paragraph for each sentence she writes. The encoder states are the open textbook. The attention weights are where her eyes point. The context vector is the information her eyes just extracted.

The key equations

eₜᵢ = vₐᵀ · tanh(Wₐ · sₜ₋₁ + Uₐ · hᵢ)     ← alignment score (additive attention)
αₜᵢ = exp(eₜᵢ) / Σⱼ exp(eₜⱼ)                ← attention weight (softmax)
cₜ  = Σᵢ αₜᵢ · hᵢ                            ← context vector (weighted sum)
sₜ  = f(sₜ₋₁, yₜ₋₁, cₜ)                      ← decoder update

What it unlocked

Standard practice in NMT within months
Luong attention (2015): simplified dot-product variant
The Transformer (2017, Paper 08): removed the RNN entirely, kept the attention equations
Every modern language model (GPT, BERT, Claude, Gemini) uses the direct descendant of these equations

What it left open

Quadratic O(T×S) complexity limits sequence length
Still sequential (RNN bottleneck) — slow to train at scale
Only cross-attention (decoder→encoder), not self-attention within a sequence
Additive attention is slower than dot-product variants

Difficulty

🔴 The math (Section 4) is advanced undergrad — matrix products, tanh, softmax. 🟡 The concept (Sections 1–3) and the code (Section 6) are first-year college. 🟢 The summary and analogy are accessible to anyone.

Next paper: Paper 08 — Attention Is All You Need (Transformer) → Back to: Paper 06 — Seq2Seq →