7. Limitations — what the Transformer still gets wrong

The Transformer is one of the most consequential architectures in the history of AI. It is also, like all designs, a set of trade-offs. Understanding its limitations is essential for making sense of the research agenda from 2018 onwards — every major paper that followed was at least partly a response to one of these problems.

1. Quadratic complexity in sequence length

This is the Transformer’s defining constraint.

The attention score matrix Q · Kᵀ has shape (T × T). Computing it costs O(T²) time and O(T²) memory. For a sequence of 512 tokens, that is 262,144 scores. For 4,096 tokens, it is 16,777,216. For 100,000 tokens — a short book — it is 10 billion.

The original Transformer was designed for sentences of about 100 tokens. Extending it to long documents required either chunking (losing global context) or accepting enormous memory costs. The maximum context length of early GPT models (512 tokens for GPT-1, 1024 for GPT-2) was driven entirely by this quadratic cost.

This spawned an entire sub-field of efficient attention research: Longformer (2020), BigBird (2020), Linformer (2020), Performer (2020), FlashAttention (2022). These papers propose approximations, sparse patterns, or algorithmic tricks to reduce or eliminate the quadratic bottleneck. Paper 19 (Ring Attention) in this curriculum directly addresses scaling to extremely long sequences.

2. Positional encoding is a hand-engineered hack

Recurrent networks knew word order by construction — they processed words in sequence. The Transformer, processing everything in parallel, had to add positional information as an explicit engineering choice.

The sinusoidal positional encoding works but is somewhat arbitrary. Why sine and cosine? Why that particular frequency range? The paper offers mathematical motivation but no proof that this is optimal.

More practically: sinusoidal positional encodings are absolute — each position has a fixed encoding. But language often cares about relative position: “the word two positions before” rather than “the word at position 7.” Relative positional encodings (Shaw et al., 2018) and rotary positional embeddings (RoPE, used in LLaMA and many modern models) were developed to address this.

3. No recurrence means no natural length generalisation

An RNN trained on sentences up to length 50 can often handle sentences of length 100 at inference time — the recurrence pattern generalises. A Transformer trained on sequences of length 512 typically fails on sequences of length 1024 at inference time, because the positional encodings for those positions were never seen during training.

This is a structural limitation: the model has no inductive bias that says “patterns should repeat across longer sequences.” Researchers have worked around this with various positional encoding schemes and training on longer contexts, but it remains an ongoing challenge.

4. Feed-forward layers are position-wise and stateless

The FFN sub-layer processes each position independently. It has no memory of what was at other positions (that is attention’s job). It is also applied identically at every layer — no specialisation by depth.

This means the FFN is doing the same generic MLP transformation at every position in every layer. Researchers have since shown that FFN layers act as key-value memories (Geva et al., 2021) and that their capacity scales with width. But the original design left significant capacity underutilised.

Mixture-of-Experts models (Paper 09, Shazeer et al.) address this by replacing a single FFN with many specialised FFNs, routing each token to a different expert. This allows the model to have conditional computation — different experts activated for different tokens.

5. Training requires enormous data and compute

The original Transformer used 36 million English-French sentence pairs and trained for 3.5 days on 8 P100 GPUs. At the time, this was considered expensive.

By 2020, large language models were using hundreds of billions of parameters and training on trillions of tokens across thousands of GPUs for months. The Transformer architecture scales well, but that scalability is a double-edged sword — it means you need significant compute to achieve meaningful results.

This created a divide in AI research: organisations without large compute budgets effectively could not compete on the frontier. Papers 12, 13, 17, and 18 in this curriculum all grapple with questions of compute efficiency.

What the Transformer got definitively right

Despite all these limitations, two things the paper got exactly right have proven remarkably durable:

The core attention formula softmax(QKᵀ/√dₖ)V has not changed in any major model. GPT-4, Claude, Gemini — all use this exact operation.

The encoder-decoder structure with residual connections and layer normalisation has proven robustly trainable at any scale. The specific hyperparameters (N=6, h=8, d_model=512) are specific to the translation task, but the architectural pattern has held.