7. Limitations — what attention still got wrong
Bahdanau attention was a genuine breakthrough. But looking back from 2025, it is clear that the paper solved one problem while leaving several others untouched — and one of its hidden limitations turned out to be a fatal constraint on scaling.
1. Quadratic time and memory complexity
At each decoding step t, the model computes one alignment score for every source position. If the source sentence has T words and the target sentence has S words, the total number of attention calculations is T × S.
For a sentence of 50 words (source) translated to 50 words (target), this is 2,500 computations. Fine. But for a document of 1,000 words, it is 1,000,000 computations. For 10,000 words — which is a medium-length research paper — it is 100,000,000. The cost grows as O(T × S), quadratically with length.
This is not a minor inconvenience. It is the fundamental reason that RNN-based attention models were never used for very long texts. The Transformer (Paper 08) inherits the same quadratic complexity in its self-attention — and this became one of the major research problems of the 2020s, spawning efficient attention variants like Longformer, Performer, and FlashAttention.
2. Sequential computation — no parallelism
Because the decoder is a recurrent network (GRU or LSTM), it must generate word 1 before word 2, word 2 before word 3, and so on. The computation is inherently sequential. You cannot parallelise across the output sequence.
On a modern GPU with thousands of cores, this sequential bottleneck is wasteful. GPUs are designed to process many things simultaneously, but a sequential decoder uses them like a single-threaded CPU. Training on large datasets was slow.
The Transformer architecture (Paper 08) addressed this directly. By removing recurrence entirely and replacing it with self-attention, the Transformer can process all positions simultaneously during training. This is the primary reason Transformers replaced RNNs, not just the performance improvement.
3. Still an RNN at its core
Bahdanau’s model is still a seq2seq RNN with attention bolted on. The encoder and decoder are still LSTMs or GRUs. These have known limitations:
- They still struggle with very long-range dependencies (beyond ~100 words), even with attention helping the decoder
- Gradient flow through long sequences remains difficult despite the attention shortcut
- The encoder still reads the source sequentially — it cannot be parallelised during encoding
Attention fixes the decoder’s inability to re-read the source, but it does not fix the fundamental sequential nature of RNNs.
4. Additive attention is slower than dot-product
Bahdanau’s alignment function — vₐᵀ tanh(Wₐs + Uₐh) — requires a matrix multiplication, an addition, a tanh, and a dot product for each (source position, decoding step) pair. This is more expensive than the simple dot product s · h used by Luong attention (2015) and later the Transformer.
In 2015, Luong et al. showed that simpler dot-product attention achieved similar BLEU scores at lower computational cost. By 2017, when the Transformer was published, additive attention was already considered the slower option.
5. Only cross-attention, not self-attention
Bahdanau attention is a cross-attention mechanism: the decoder looks at the encoder states. There is no mechanism for each encoder position to attend to other encoder positions, or for the decoder to attend to its own past outputs.
Self-attention — where every position in a sequence can attend to every other position in the same sequence — was the key innovation of the Transformer. It allowed the model to capture relationships within a single sequence (e.g., “The server crashed because it ran out of memory” — the pronoun “it” refers to “server,” not “memory”). Bahdanau attention cannot express this.
What the paper got just right
Despite these limitations, it is worth noting what Bahdanau’s paper correctly identified: the key insight that a fixed-size context vector is a bottleneck, and that the solution is a dynamic, query-dependent weighted combination of all source representations. That insight, stripped of the RNN scaffolding, is what the Transformer built upon. The core equations of attention — align, normalise with softmax, weighted sum — survived intact into every major model built after 2017.