Further Reading — Improving Language Understanding by Generative Pre-Training
Further Reading — GPT-1
The original paper
Improving Language Understanding by Generative Pre-Training Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever — OpenAI, 2018 https://openai.com/research/language-unsupervised
The paper is 12 pages, well-written, and accessible to anyone who has read Papers 08 and this guide. Pay special attention to Section 3 (the framework), Table 1 (task comparison with SOTA), and Table 5 (ablation studies).
Immediate successors (read these next)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin, Chang, Lee, Toutanova — Google, 2018 (Paper 11 on this site) Published two months after GPT-1. Takes the encoder approach (bidirectional). Dominated NLU benchmarks for two years. Understanding both GPT-1 and BERT gives you the full picture of the GPT-vs-BERT trade-off.
Language Models are Unsupervised Multitask Learners (GPT-2) Radford, Wu, Child, Luan, Amodei, Sutskever — OpenAI, 2019 https://openai.com/research/better-language-models GPT-1 × 13 parameters, trained on WebText. Shows strong zero-shot behaviour across tasks. The paper that attracted the first wave of mainstream AI coverage.
Code resources
minGPT — Andrej Karpathy https://github.com/karpathy/minGPT A clean, from-scratch implementation of GPT in ~300 lines of PyTorch. The best way to understand every component of the GPT architecture. Extensively commented and well-explained.
nanoGPT — Andrej Karpathy (updated version of minGPT) https://github.com/karpathy/nanoGPT Optimised for actual training. You can train a small GPT-2-level model on character sequences from a laptop.
The Annotated Transformer — Harvard NLP https://nlp.seas.harvard.edu/annotated-transformer/ Line-by-line annotation of the Transformer paper in code. Essential for understanding the self-attention mechanism that GPT-1 is built on.
Conceptual explanations
The Illustrated GPT-2 — Jay Alammar https://jalammar.github.io/illustrated-gpt2/ Visual, intuitive walkthrough of how GPT-2 (and by extension GPT-1) generates text. The attention visualisations are especially helpful.
Let’s build GPT: from scratch, in code, spelled out — Andrej Karpathy (YouTube) https://www.youtube.com/watch?v=kCc8FmEb1nY 3-hour video building a character-level language model from scratch in PyTorch, culminating in a GPT-like model. Essential viewing for anyone who wants to deeply understand autoregressive language modelling.
Historical context
ELMo: Deep contextualized word representations — Peters et al., 2018 https://arxiv.org/abs/1802.05365 Published just before GPT-1. Showed that contextualised representations improve NLP — but still required task-specific architectures. GPT-1 improved on this by using a single architecture for all tasks.
Semi-supervised Sequence Learning — Dai and Le, 2015 https://arxiv.org/abs/1511.01432 An earlier paper that pre-trained LSTMs on unlabelled text and fine-tuned on labelled tasks — a predecessor to GPT-1’s paradigm, but at smaller scale and without the Transformer architecture.
What to read on this site next
| Paper | Why |
|---|---|
| Paper 11 — BERT | The bidirectional counterpart to GPT-1; defines the other half of modern NLP |
| Paper 12 — GPT-3 | 175B parameters; in-context learning; the commercial tipping point |
| Paper 13 — Scaling Laws | Why more data + bigger models = better results, and how to predict by how much |
| Paper 15 — RLHF / InstructGPT | How GPT becomes an assistant; the missing piece that makes language models useful |