Paper 12
Intermediate

Language Models are Few-Shot Learners

Language Models are Few-Shot Learners

Authors: Tom Brown, Benjamin Mann, Nick Ryder, and 29 others
Venue: NeurIPS 2020
Year: 2020
URL: https://arxiv.org/abs/2005.14165


What This Paper Did

GPT-3 took the decoder-only transformer architecture from GPT-1 and scaled it to 175 billion parameters, trained on 300 billion tokens from the web. The innovation was not in the architecture—it was purely in the scale.

The breakthrough: when you scale a language model to this size, something magical happens. Without any fine-tuning, without updating a single weight, the model can perform new tasks just by reading examples in the prompt. You write a few examples (few-shot), and the model learns the pattern. Zero-shot (no examples), one-shot (one example), few-shot (2–5 examples)—the model figures it out from context alone. This phenomenon is called in-context learning (ICL).

GPT-3 was tested on 42 different tasks. On many of them—summarization, translation, arithmetic, code generation—few-shot GPT-3 beat or matched fine-tuned BERT, the previous state-of-the-art. Most remarkably, it could do things nobody expected: solve math word problems, write poems, generate Python code from natural language descriptions, perform logical reasoning, all without being explicitly trained on those tasks.

The key equations are the same as GPT-1:

Autoregressive language model loss:
L = -Σ log P(u_i | u_1, ..., u_{i-1})

In-context learning setup:
(x_1, y_1), (x_2, y_2), ..., (x_k, y_k), x_test  →  model predicts y_test
where (x_i, y_i) are examples in the prompt, x_test is a new input, and 
the model outputs y_test without any weight updates.

Key numbers:

  • 175 billion parameters
  • 96 transformer layers
  • 96 attention heads
  • Hidden dimension 12,288
  • Trained on 300 billion tokens
  • 570 GB of compressed text data

This single paper changed the field. It proved that scale is the primary lever for capability. It introduced “prompt engineering” as a practical skill. It showed that the fine-tuning paradigm (train a task-specific model for each new task) could be replaced with a prompting paradigm (use one giant model, adapt via the prompt).


The Indian Analogy

Imagine a brilliant student who reads voraciously—thousands of textbooks, novels, news articles, encyclopedias. This student has absorbed so much knowledge that you can describe almost any new task with just 2–3 examples written on an exam sheet, and they’ll figure out the pattern.

Example: You write on the exam paper:

  • Q: “What is 2 + 2?” A: “4”
  • Q: “What is 5 + 3?” A: “8”
  • Q: “What is 7 + 4?” A: [student writes “11” without being taught addition]

The student learned addition from the context. That’s few-shot learning.

If you gave zero examples—just the problem 7 + 4 with no prior context—the student might still guess it’s arithmetic and attempt it. That’s zero-shot learning. With a single example, one-shot. With several examples, few-shot.

The reason this works is that the student’s massive background knowledge (from reading billions of words) includes patterns of how language is used, mathematical reasoning, factual information, logic, and more. When you give examples in the prompt, you’re activating that latent knowledge without retraining the student.

In contrast, consider BERT (the previous champion). BERT is like a smart student who hasn’t done much independent reading. You have to enroll them in a tutoring course (fine-tuning) for each new task, give them labeled examples, have them practice repeatedly, before they can solve new problems. BERT is task-specific. GPT-3 is task-agnostic—it adapts from prompt context.


Comparison: GPT-3 vs BERT vs GPT-1

AspectGPT-1 (2018)BERT (2018)GPT-3 (2020)
Parameters117M340M175B
ArchitectureDecoder-only TransformerEncoder-only TransformerDecoder-only Transformer
Training objectiveCausal language modelingMasked language modeling + NSPCausal language modeling
Task adaptationFine-tune on labeled dataFine-tune on labeled dataPrompt from examples (no tuning)
Few-shot capabilityMinimalMinimalStrong
Emergent abilitiesNoneNoneArithmetic, code, translation, reasoning
Training compute~4 GPU-years~4 GPU-years~3,640 GPU-years
What changedShowed decoder architecture worksShowed masked LM is powerfulProved scale unlocks in-context learning

Read in This Order

SectionWhat You Will LearnDifficultyTime
01. ContextWhy 2020 needed a new approach; state of fine-tuning; why scale mattered🟢 Beginner8 min
02. The ProblemWhy fine-tuning every task is expensive; why BERT wasn’t enough🟡 Intermediate7 min
03. The IdeaIn-context learning; zero/one/few-shot prompting; why scale enables it🟡 Intermediate10 min
04. The MathCross-entropy loss (review); in-context learning as conditional probability🟡 Intermediate8 min
05. Worked ExampleTracing few-shot sentiment classification step-by-step🟡 Intermediate10 min
06. The CodeFew-shot prompting in Python; sentiment classification🟡 Intermediate6 min
07. LimitationsCompute cost; hallucinations; prompt sensitivity; no fine-tuning🟢 Beginner6 min
08. ImpactChatGPT, Copilot, InstructGPT, products; prompt engineering as a skill🟢 Beginner5 min
09. SummaryOne-pager recap🟢 Beginner3 min

Before You Read: Math Tutorials You Need


Architecture Diagram

GPT-3 Architecture (175B parameter version)
═════════════════════════════════════════════

Input tokens


Token embedding (vocab=50,257, dim=12,288)


Positional embedding (max 2048 tokens)


┌─────────────────────────────────────────┐
│  Decoder Block 1  (Self-Attention)     │
│  - 96 attention heads                  │
│  - Head dimension: 12,288/96 = 128     │
│  - Feedforward: 4×12,288 = 49,152      │
│  - Layer norm, residual connections    │
└─────────────────────────────────────────┘


┌─────────────────────────────────────────┐
│  Decoder Block 2                        │
└─────────────────────────────────────────┘


  [... 94 more blocks, same structure ...]


┌─────────────────────────────────────────┐
│  Decoder Block 96                       │
└─────────────────────────────────────────┘


Layer Normalization


Output projection (dim=12,288 → vocab=50,257)


Softmax


Next token probabilities


Sample or argmax to get next token


Repeat (autoregressive generation)

← Previous: Paper 11: BERT
Next → Paper 13: Scaling Laws

Jump to section:

Glossary | Quiz | Further Reading

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.