Section 07

Limitations: what GPT-1 could not do

Improving Language Understanding by Generative Pre-Training 2018

7. Limitations — what GPT-1 could not do

GPT-1 was a landmark result, but understanding its limitations is as important as understanding its achievements. Several of these limitations directly motivated the next generation of models.


1. Unidirectional context

GPT-1 uses a causal decoder — each token attends only to previous tokens, never to future ones. This is necessary for the autoregressive pre-training objective, but it is a real constraint during fine-tuning.

Consider the sentence: “The bank was steep, not a place to deposit money.”

The word “bank” has two meanings. To understand which meaning is intended, you need to see “not a place to deposit money” — which comes after “bank.” A unidirectional model reading left-to-right builds the representation of “bank” without access to this disambiguating context.

BERT (Paper 11) was published just two months after GPT-1 and directly addressed this limitation by using a bidirectional encoder — every token can attend to every other token. For tasks requiring deep understanding of a fixed input (reading comprehension, entailment), BERT’s bidirectional context proved significantly better.

The trade-off: BERT cannot generate text autoregressively (because its bidirectional attention requires seeing the whole sequence). GPT-1 can. This is not a flaw — it is a fundamental architectural choice. Decoder-only models (GPT family) became the standard for language generation. Encoder-only models (BERT family) became the standard for understanding tasks where the full input is given.


2. Fine-tuning still requires labelled data

GPT-1 dramatically reduced the amount of labelled data needed, but it did not eliminate it entirely. For tasks with very few examples (say, fewer than 100), GPT-1’s fine-tuning still struggled.

GPT-2 (2019) and especially GPT-3 (2020) addressed this through scale: by training on far more data with a far larger model, they showed that few-shot and zero-shot performance could become competitive without any fine-tuning at all. But GPT-1 still required at least hundreds of labelled examples per task.

This was a real barrier for low-resource languages and specialised domains (e.g., medical records, legal documents) where annotated data is expensive.


3. Context window of 512 tokens

GPT-1 could only see 512 tokens at a time — roughly 400 words. For tasks requiring understanding of long documents (legal contracts, research papers, books), this was insufficient. Information beyond 512 tokens was simply invisible to the model.

Subsequent architectures addressed this through various means: relative positional embeddings, sparse attention, sliding-window attention, and eventually the key-value cache architectures used in modern long-context models (100,000+ tokens).


4. No instruction following

GPT-1 was trained to predict the next word. It was not trained to follow instructions or to be helpful in a conversational sense. If you typed “Summarise this document for me,” the model would — at best — continue generating text in a plausible document-summary style. It would not understand that you were asking it to do something for you.

Making language models actually follow instructions required RLHF (Reinforcement Learning from Human Feedback), described in Papers 15 and 16. This was not obvious in 2018 — it took until 2022 to become standard practice.


5. The pre-training data was not diverse enough

BooksCorpus contained 7,000 novels in English. Books are high-quality, long-form text — good for learning long-range dependencies. But they are also a narrow slice of human language use. Scientific writing, legal language, code, non-native English, Indian English patterns, and many other registers were absent or underrepresented.

GPT-2 switched to WebText (40GB of web pages) and GPT-3 to Common Crawl + books + Wikipedia + code (300 billion tokens). The data scale and diversity improvement was as important as the model scale improvement.


6. No multilingual support

GPT-1 was trained exclusively on English text. A student reading this in Patna or Jaipur would find the model useless for Hindi, Bengali, or any other Indian language. This was not an inherent limitation of the architecture — subsequent models (mBERT, XLM-R, and others) showed that multilingual pre-training works well. But GPT-1 did not pursue it.


The limitations as a roadmap

Looking at these limitations in 2018 gives you a roadmap of the next five years of AI research:

LimitationSolution (which paper)
Unidirectional contextBERT — bidirectional encoder (Paper 11)
Needs labelled fine-tuningGPT-3 few-shot, then RLHF (Papers 12, 15)
Short context windowSparse attention, sliding window, RoPE
No instruction followingInstructGPT / RLHF (Paper 15)
Narrow dataGPT-3 data pipeline (Paper 12)
English onlyMultilingual models (mBERT, XLM-R, Gemini)

GPT-1 is not the end of the story. It is the beginning of a specific, increasingly powerful paradigm — one that is still being extended today.