2. The problem — sequential computation cannot scale
Bahdanau’s attention model was a genuine improvement. But it still had the recurrence problem baked in. It improved the decoder’s access to source information, but both the encoder and decoder were still LSTMs — still sequential, still unable to parallelise across the sequence length.
Two distinct problems held the field back.
Problem 1: Sequential computation wastes GPUs
Modern GPUs are massively parallel processors. A high-end GPU in 2017 had thousands of cores, each capable of running computations simultaneously. Their power comes entirely from doing many things at once.
An LSTM running on a GPU uses perhaps a dozen of those thousands of cores at a time. The rest sit idle. The GPU is being used like a single-threaded CPU. You have rented an entire cricket stadium and invited one player.
The cause: hidden state hₜ depends on hₜ₋₁. You cannot compute step t until step t−1 is done. For a sequence of length T, the minimum number of sequential steps is T — and you cannot shrink this no matter how powerful your hardware.
Training on large datasets — which AI progress requires — meant weeks per experiment when days should have been enough. Researchers wanting to try a new idea had to wait a week to see if it worked.
Problem 2: Long-range dependencies are hard
Even with Bahdanau attention helping the decoder, the encoder still read the source sequentially. Information about word 1 had to travel through hidden states h₂, h₃, h₄… before reaching h₅₀ at the end of the sentence. With each step, the gradient signal for word 1 had to flow backward through all those steps — and gradients are prone to vanishing over long paths.
This means an LSTM encoder, even with attention, struggles to preserve fine-grained information about words that appeared far back in the sequence. Words separated by 30 positions are much harder to relate than words 3 positions apart.
The attention mechanism in Bahdanau is “cross-attention” — the decoder attends over encoder states. But within the encoder itself, word 1 cannot directly look at word 30. It can only influence word 30 indirectly, by passing information through all the intermediate hidden states. That indirect path is noisy and lossy.
What the ideal solution would look like
The ideal model would:
- Process all positions in parallel — no sequential dependencies during training
- Allow any position to directly attend to any other position — path length between any two words is exactly 1, not proportional to sequence length
- Keep the attention mechanism — it was demonstrably useful for alignment and interpretability
- Work for both encoding and decoding — including masked decoding for generation
The Transformer provides all four. It replaces recurrence with self-attention: a mechanism where every position in a sequence attends to every other position simultaneously, in a single matrix operation that GPUs can execute fully in parallel.