Paper 10 — Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever · OpenAI · 2018

What this paper did

It proved that a single pre-trained model, fine-tuned with minimal changes, could beat purpose-built models across a wide range of language tasks.

Before GPT-1, the standard approach to NLP was: gather labelled data for your specific task (sentiment, question answering, textual entailment), design a task-specific architecture, train it from scratch. This worked, but required expensive labelled datasets for every new task, and each model started with zero knowledge.

Radford et al. took the decoder half of the Transformer and pre-trained it on 800 million words of BooksCorpus using a single objective: predict the next word. No labels needed — the supervision comes from the text itself. After pre-training, they fine-tuned the same model on small labelled datasets with one key constraint: no changes to the architecture. They transformed the input to match the pre-training format instead.

The result beat state-of-the-art on 9 of 12 NLP benchmarks, including tasks the model was never explicitly designed for.

The key equations:

Pre-training loss:  L₁(U) = Σᵢ log P(uᵢ | uᵢ₋ₖ,...,uᵢ₋₁; Θ)

Fine-tuning loss:   L₂(C) = Σ log P(y | x¹,...,xᵐ)

Combined loss:      L₃(C) = L₂(C) + λ·L₁(C)

Where U is the unlabelled text corpus, C is the labelled downstream dataset, and λ is a small weight that keeps the language modelling objective active during fine-tuning.

The Indian analogy

Consider a student who, before the Board exams, spent three years reading every novel, newspaper, science magazine, and history book they could find. They never crammed any specific exam syllabus — they just read broadly and deeply.

Now, one month before the exam, they spend a week on each subject’s past papers (fine-tuning). Because they already understand how arguments are constructed (language), how stories develop (reasoning), and how facts relate (knowledge), they need very few practice examples to ace each specific test.

Contrast this with a classmate who started studying only when the syllabus was announced, with no prior reading. That classmate needs months of subject-specific coaching and still knows only what was explicitly taught.

GPT-1’s pre-training is the three years of broad reading. Fine-tuning is the one-month sprint. The pre-trained model starts with a head start that no task-specific model can match — because language understanding transfers across tasks.

Read in this order

Section	What you will learn	Difficulty	Time
1. Context	NLP in 2018 — the labelled data bottleneck	🟢	4 min
2. The Problem	Why task-specific models fail to generalise	🟢	3 min
3. The Idea	Pre-train on books, fine-tune on tasks — no architecture changes	🟡	5 min
4. The Math	Autoregressive LM objective, fine-tuning loss, input transformations	🔴	10 min
5. Worked Example	Forward pass through GPT-1 on a sentiment classification task	🔴	8 min
6. The Code	Causal language model in NumPy; input transformation for classification	🟡	6 min
7. Limitations	Unidirectional context, no instruction following, fine-tuning still needs labels	🟡	4 min
8. Impact	GPT-2, GPT-3, and how GPT-1’s paradigm took over AI	🟢	4 min
9. Summary	One-page recap	🟢	2 min

Also: Glossary · Quiz · Further Reading

Before you read: math tutorials you need

Conditional Probability → — the autoregressive objective is built on P(wₜ | w₁,…,wₜ₋₁) ✅
Cross-Entropy Loss → — pre-training minimises cross-entropy over next-token predictions ✅
Softmax Function → — converts logits to token probabilities at every decoding step ✅
Transformer (Paper 08) → — GPT-1 uses the decoder stack from this paper ✅

GPT-1 architecture at a glance

Input tokens (text + special markers)
       │
       ▼
 Token Embedding + Positional Embedding
       │
       ▼
 ┌───────────────────────────────────────┐
 │  Transformer Decoder Block × 12      │
 │                                      │
 │  Masked Multi-Head Self-Attention     │  ← causal: each token sees only past
 │  Feed-Forward Network                │
 │  Layer Norm + Residual               │
 └───────────────────────────────────────┘
       │
       ▼
 Linear layer → Softmax → P(next token)       [pre-training]
       OR
 Linear layer → Softmax → P(class label)      [fine-tuning]

The same 12-layer decoder handles both. No architecture changes between pre-training and fine-tuning — only the output head changes.

← Paper 09 — Mixture of Experts → Paper 11 — BERT

Improving Language Understanding by Generative Pre-Training