Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Authors: Tom Brown, Benjamin Mann, Nick Ryder, and 29 others
Venue: NeurIPS 2020
Year: 2020
URL: https://arxiv.org/abs/2005.14165
What This Paper Did
GPT-3 took the decoder-only transformer architecture from GPT-1 and scaled it to 175 billion parameters, trained on 300 billion tokens from the web. The innovation was not in the architecture—it was purely in the scale.
The breakthrough: when you scale a language model to this size, something magical happens. Without any fine-tuning, without updating a single weight, the model can perform new tasks just by reading examples in the prompt. You write a few examples (few-shot), and the model learns the pattern. Zero-shot (no examples), one-shot (one example), few-shot (2–5 examples)—the model figures it out from context alone. This phenomenon is called in-context learning (ICL).
GPT-3 was tested on 42 different tasks. On many of them—summarization, translation, arithmetic, code generation—few-shot GPT-3 beat or matched fine-tuned BERT, the previous state-of-the-art. Most remarkably, it could do things nobody expected: solve math word problems, write poems, generate Python code from natural language descriptions, perform logical reasoning, all without being explicitly trained on those tasks.
The key equations are the same as GPT-1:
Autoregressive language model loss:
L = -Σ log P(u_i | u_1, ..., u_{i-1})
In-context learning setup:
(x_1, y_1), (x_2, y_2), ..., (x_k, y_k), x_test → model predicts y_test
where (x_i, y_i) are examples in the prompt, x_test is a new input, and
the model outputs y_test without any weight updates.
Key numbers:
- 175 billion parameters
- 96 transformer layers
- 96 attention heads
- Hidden dimension 12,288
- Trained on 300 billion tokens
- 570 GB of compressed text data
This single paper changed the field. It proved that scale is the primary lever for capability. It introduced “prompt engineering” as a practical skill. It showed that the fine-tuning paradigm (train a task-specific model for each new task) could be replaced with a prompting paradigm (use one giant model, adapt via the prompt).
The Indian Analogy
Imagine a brilliant student who reads voraciously—thousands of textbooks, novels, news articles, encyclopedias. This student has absorbed so much knowledge that you can describe almost any new task with just 2–3 examples written on an exam sheet, and they’ll figure out the pattern.
Example: You write on the exam paper:
- Q: “What is 2 + 2?” A: “4”
- Q: “What is 5 + 3?” A: “8”
- Q: “What is 7 + 4?” A: [student writes “11” without being taught addition]
The student learned addition from the context. That’s few-shot learning.
If you gave zero examples—just the problem 7 + 4 with no prior context—the student might still guess it’s arithmetic and attempt it. That’s zero-shot learning. With a single example, one-shot. With several examples, few-shot.
The reason this works is that the student’s massive background knowledge (from reading billions of words) includes patterns of how language is used, mathematical reasoning, factual information, logic, and more. When you give examples in the prompt, you’re activating that latent knowledge without retraining the student.
In contrast, consider BERT (the previous champion). BERT is like a smart student who hasn’t done much independent reading. You have to enroll them in a tutoring course (fine-tuning) for each new task, give them labeled examples, have them practice repeatedly, before they can solve new problems. BERT is task-specific. GPT-3 is task-agnostic—it adapts from prompt context.
Comparison: GPT-3 vs BERT vs GPT-1
| Aspect | GPT-1 (2018) | BERT (2018) | GPT-3 (2020) |
|---|---|---|---|
| Parameters | 117M | 340M | 175B |
| Architecture | Decoder-only Transformer | Encoder-only Transformer | Decoder-only Transformer |
| Training objective | Causal language modeling | Masked language modeling + NSP | Causal language modeling |
| Task adaptation | Fine-tune on labeled data | Fine-tune on labeled data | Prompt from examples (no tuning) |
| Few-shot capability | Minimal | Minimal | Strong |
| Emergent abilities | None | None | Arithmetic, code, translation, reasoning |
| Training compute | ~4 GPU-years | ~4 GPU-years | ~3,640 GPU-years |
| What changed | Showed decoder architecture works | Showed masked LM is powerful | Proved scale unlocks in-context learning |
Read in This Order
| Section | What You Will Learn | Difficulty | Time |
|---|---|---|---|
| 01. Context | Why 2020 needed a new approach; state of fine-tuning; why scale mattered | 🟢 Beginner | 8 min |
| 02. The Problem | Why fine-tuning every task is expensive; why BERT wasn’t enough | 🟡 Intermediate | 7 min |
| 03. The Idea | In-context learning; zero/one/few-shot prompting; why scale enables it | 🟡 Intermediate | 10 min |
| 04. The Math | Cross-entropy loss (review); in-context learning as conditional probability | 🟡 Intermediate | 8 min |
| 05. Worked Example | Tracing few-shot sentiment classification step-by-step | 🟡 Intermediate | 10 min |
| 06. The Code | Few-shot prompting in Python; sentiment classification | 🟡 Intermediate | 6 min |
| 07. Limitations | Compute cost; hallucinations; prompt sensitivity; no fine-tuning | 🟢 Beginner | 6 min |
| 08. Impact | ChatGPT, Copilot, InstructGPT, products; prompt engineering as a skill | 🟢 Beginner | 5 min |
| 09. Summary | One-pager recap | 🟢 Beginner | 3 min |
Before You Read: Math Tutorials You Need
Architecture Diagram
GPT-3 Architecture (175B parameter version)
═════════════════════════════════════════════
Input tokens
│
↓
Token embedding (vocab=50,257, dim=12,288)
│
↓
Positional embedding (max 2048 tokens)
│
↓
┌─────────────────────────────────────────┐
│ Decoder Block 1 (Self-Attention) │
│ - 96 attention heads │
│ - Head dimension: 12,288/96 = 128 │
│ - Feedforward: 4×12,288 = 49,152 │
│ - Layer norm, residual connections │
└─────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Decoder Block 2 │
└─────────────────────────────────────────┘
│
↓
[... 94 more blocks, same structure ...]
│
↓
┌─────────────────────────────────────────┐
│ Decoder Block 96 │
└─────────────────────────────────────────┘
│
↓
Layer Normalization
│
↓
Output projection (dim=12,288 → vocab=50,257)
│
↓
Softmax
│
↓
Next token probabilities
│
↓
Sample or argmax to get next token
│
↓
Repeat (autoregressive generation)
Navigation
← Previous: Paper 11: BERT
Next → Paper 13: Scaling Laws
Jump to section:
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.