Improving Language Understanding by Generative Pre-Training
Paper 10 — Improving Language Understanding by Generative Pre-Training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever · OpenAI · 2018
What this paper did
It proved that a single pre-trained model, fine-tuned with minimal changes, could beat purpose-built models across a wide range of language tasks.
Before GPT-1, the standard approach to NLP was: gather labelled data for your specific task (sentiment, question answering, textual entailment), design a task-specific architecture, train it from scratch. This worked, but required expensive labelled datasets for every new task, and each model started with zero knowledge.
Radford et al. took the decoder half of the Transformer and pre-trained it on 800 million words of BooksCorpus using a single objective: predict the next word. No labels needed — the supervision comes from the text itself. After pre-training, they fine-tuned the same model on small labelled datasets with one key constraint: no changes to the architecture. They transformed the input to match the pre-training format instead.
The result beat state-of-the-art on 9 of 12 NLP benchmarks, including tasks the model was never explicitly designed for.
The key equations:
Pre-training loss: L₁(U) = Σᵢ log P(uᵢ | uᵢ₋ₖ,...,uᵢ₋₁; Θ)
Fine-tuning loss: L₂(C) = Σ log P(y | x¹,...,xᵐ)
Combined loss: L₃(C) = L₂(C) + λ·L₁(C)
Where U is the unlabelled text corpus, C is the labelled downstream dataset, and λ is a small weight that keeps the language modelling objective active during fine-tuning.
The Indian analogy
Consider a student who, before the Board exams, spent three years reading every novel, newspaper, science magazine, and history book they could find. They never crammed any specific exam syllabus — they just read broadly and deeply.
Now, one month before the exam, they spend a week on each subject’s past papers (fine-tuning). Because they already understand how arguments are constructed (language), how stories develop (reasoning), and how facts relate (knowledge), they need very few practice examples to ace each specific test.
Contrast this with a classmate who started studying only when the syllabus was announced, with no prior reading. That classmate needs months of subject-specific coaching and still knows only what was explicitly taught.
GPT-1’s pre-training is the three years of broad reading. Fine-tuning is the one-month sprint. The pre-trained model starts with a head start that no task-specific model can match — because language understanding transfers across tasks.
Read in this order
| Section | What you will learn | Difficulty | Time |
|---|---|---|---|
| 1. Context | NLP in 2018 — the labelled data bottleneck | 🟢 | 4 min |
| 2. The Problem | Why task-specific models fail to generalise | 🟢 | 3 min |
| 3. The Idea | Pre-train on books, fine-tune on tasks — no architecture changes | 🟡 | 5 min |
| 4. The Math | Autoregressive LM objective, fine-tuning loss, input transformations | 🔴 | 10 min |
| 5. Worked Example | Forward pass through GPT-1 on a sentiment classification task | 🔴 | 8 min |
| 6. The Code | Causal language model in NumPy; input transformation for classification | 🟡 | 6 min |
| 7. Limitations | Unidirectional context, no instruction following, fine-tuning still needs labels | 🟡 | 4 min |
| 8. Impact | GPT-2, GPT-3, and how GPT-1’s paradigm took over AI | 🟢 | 4 min |
| 9. Summary | One-page recap | 🟢 | 2 min |
Also: Glossary · Quiz · Further Reading
Before you read: math tutorials you need
- Conditional Probability → — the autoregressive objective is built on P(wₜ | w₁,…,wₜ₋₁) ✅
- Cross-Entropy Loss → — pre-training minimises cross-entropy over next-token predictions ✅
- Softmax Function → — converts logits to token probabilities at every decoding step ✅
- Transformer (Paper 08) → — GPT-1 uses the decoder stack from this paper ✅
GPT-1 architecture at a glance
Input tokens (text + special markers)
│
▼
Token Embedding + Positional Embedding
│
▼
┌───────────────────────────────────────┐
│ Transformer Decoder Block × 12 │
│ │
│ Masked Multi-Head Self-Attention │ ← causal: each token sees only past
│ Feed-Forward Network │
│ Layer Norm + Residual │
└───────────────────────────────────────┘
│
▼
Linear layer → Softmax → P(next token) [pre-training]
OR
Linear layer → Softmax → P(class label) [fine-tuning]
The same 12-layer decoder handles both. No architecture changes between pre-training and fine-tuning — only the output head changes.
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.