Language Models are Few-Shot Learners

Authors: Tom Brown, Benjamin Mann, Nick Ryder, and 29 others
Venue: NeurIPS 2020
Year: 2020
URL: https://arxiv.org/abs/2005.14165

What This Paper Did

GPT-3 took the decoder-only transformer architecture from GPT-1 and scaled it to 175 billion parameters, trained on 300 billion tokens from the web. The innovation was not in the architecture—it was purely in the scale.

The breakthrough: when you scale a language model to this size, something magical happens. Without any fine-tuning, without updating a single weight, the model can perform new tasks just by reading examples in the prompt. You write a few examples (few-shot), and the model learns the pattern. Zero-shot (no examples), one-shot (one example), few-shot (2–5 examples)—the model figures it out from context alone. This phenomenon is called in-context learning (ICL).

GPT-3 was tested on 42 different tasks. On many of them—summarization, translation, arithmetic, code generation—few-shot GPT-3 beat or matched fine-tuned BERT, the previous state-of-the-art. Most remarkably, it could do things nobody expected: solve math word problems, write poems, generate Python code from natural language descriptions, perform logical reasoning, all without being explicitly trained on those tasks.

The key equations are the same as GPT-1:

Autoregressive language model loss:
L = -Σ log P(u_i | u_1, ..., u_{i-1})

In-context learning setup:
(x_1, y_1), (x_2, y_2), ..., (x_k, y_k), x_test  →  model predicts y_test
where (x_i, y_i) are examples in the prompt, x_test is a new input, and 
the model outputs y_test without any weight updates.

Key numbers:

175 billion parameters
96 transformer layers
96 attention heads
Hidden dimension 12,288
Trained on 300 billion tokens
570 GB of compressed text data

This single paper changed the field. It proved that scale is the primary lever for capability. It introduced “prompt engineering” as a practical skill. It showed that the fine-tuning paradigm (train a task-specific model for each new task) could be replaced with a prompting paradigm (use one giant model, adapt via the prompt).

The Indian Analogy

Imagine a brilliant student who reads voraciously—thousands of textbooks, novels, news articles, encyclopedias. This student has absorbed so much knowledge that you can describe almost any new task with just 2–3 examples written on an exam sheet, and they’ll figure out the pattern.

Example: You write on the exam paper:

Q: “What is 2 + 2?” A: “4”
Q: “What is 5 + 3?” A: “8”
Q: “What is 7 + 4?” A: [student writes “11” without being taught addition]

The student learned addition from the context. That’s few-shot learning.

If you gave zero examples—just the problem 7 + 4 with no prior context—the student might still guess it’s arithmetic and attempt it. That’s zero-shot learning. With a single example, one-shot. With several examples, few-shot.

The reason this works is that the student’s massive background knowledge (from reading billions of words) includes patterns of how language is used, mathematical reasoning, factual information, logic, and more. When you give examples in the prompt, you’re activating that latent knowledge without retraining the student.

In contrast, consider BERT (the previous champion). BERT is like a smart student who hasn’t done much independent reading. You have to enroll them in a tutoring course (fine-tuning) for each new task, give them labeled examples, have them practice repeatedly, before they can solve new problems. BERT is task-specific. GPT-3 is task-agnostic—it adapts from prompt context.

Comparison: GPT-3 vs BERT vs GPT-1

Aspect	GPT-1 (2018)	BERT (2018)	GPT-3 (2020)
Parameters	117M	340M	175B
Architecture	Decoder-only Transformer	Encoder-only Transformer	Decoder-only Transformer
Training objective	Causal language modeling	Masked language modeling + NSP	Causal language modeling
Task adaptation	Fine-tune on labeled data	Fine-tune on labeled data	Prompt from examples (no tuning)
Few-shot capability	Minimal	Minimal	Strong
Emergent abilities	None	None	Arithmetic, code, translation, reasoning
Training compute	~4 GPU-years	~4 GPU-years	~3,640 GPU-years
What changed	Showed decoder architecture works	Showed masked LM is powerful	Proved scale unlocks in-context learning

Read in This Order

Section	What You Will Learn	Difficulty	Time
01. Context	Why 2020 needed a new approach; state of fine-tuning; why scale mattered	🟢 Beginner	8 min
02. The Problem	Why fine-tuning every task is expensive; why BERT wasn’t enough	🟡 Intermediate	7 min
03. The Idea	In-context learning; zero/one/few-shot prompting; why scale enables it	🟡 Intermediate	10 min
04. The Math	Cross-entropy loss (review); in-context learning as conditional probability	🟡 Intermediate	8 min
05. Worked Example	Tracing few-shot sentiment classification step-by-step	🟡 Intermediate	10 min
06. The Code	Few-shot prompting in Python; sentiment classification	🟡 Intermediate	6 min
07. Limitations	Compute cost; hallucinations; prompt sensitivity; no fine-tuning	🟢 Beginner	6 min
08. Impact	ChatGPT, Copilot, InstructGPT, products; prompt engineering as a skill	🟢 Beginner	5 min
09. Summary	One-pager recap	🟢 Beginner	3 min

Before You Read: Math Tutorials You Need

Architecture Diagram

GPT-3 Architecture (175B parameter version)
═════════════════════════════════════════════

Input tokens
    │
    ↓
Token embedding (vocab=50,257, dim=12,288)
    │
    ↓
Positional embedding (max 2048 tokens)
    │
    ↓
┌─────────────────────────────────────────┐
│  Decoder Block 1  (Self-Attention)     │
│  - 96 attention heads                  │
│  - Head dimension: 12,288/96 = 128     │
│  - Feedforward: 4×12,288 = 49,152      │
│  - Layer norm, residual connections    │
└─────────────────────────────────────────┘
    │
    ↓
┌─────────────────────────────────────────┐
│  Decoder Block 2                        │
└─────────────────────────────────────────┘
    │
    ↓
  [... 94 more blocks, same structure ...]
    │
    ↓
┌─────────────────────────────────────────┐
│  Decoder Block 96                       │
└─────────────────────────────────────────┘
    │
    ↓
Layer Normalization
    │
    ↓
Output projection (dim=12,288 → vocab=50,257)
    │
    ↓
Softmax
    │
    ↓
Next token probabilities
    │
    ↓
Sample or argmax to get next token
    │
    ↓
Repeat (autoregressive generation)

← Previous: Paper 11: BERT
Next → Paper 13: Scaling Laws

Jump to section:

Glossary | Quiz | Further Reading

Language Models are Few-Shot Learners

Language Models are Few-Shot Learners

What This Paper Did

The Indian Analogy

Comparison: GPT-3 vs BERT vs GPT-1

Read in This Order

Before You Read: Math Tutorials You Need

Architecture Diagram

Navigation

Discussion