Scaling Laws for Neural Language Models

Authors: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, and others
Venue: OpenAI Technical Report (arXiv)
Year: 2020
URL: https://arxiv.org/abs/2001.08361

What This Paper Did

Run hundreds of language model training experiments at varying scales (different parameter counts, dataset sizes, and compute budgets). The result: smooth power-law relationships.

The core finding: As you increase model size (N parameters) or dataset size (D tokens), loss decreases predictably according to power laws. These laws hold across 7 orders of magnitude (from millions to hundreds of billions of parameters).

The Key Equations:

Cross-entropy loss as a function of model size:
L(N) = (N_0 / N)^α_N

where:
  L(N) = cross-entropy loss
  N = number of model parameters
  N_0 = a reference parameter count
  α_N ≈ 0.076 (the exponent, empirically estimated)

Cross-entropy loss as a function of dataset size:
L(D) = (D_0 / D)^α_D

where:
  D = number of training tokens
  D_0 = a reference dataset size
  α_D ≈ 0.103 (the exponent)

Compute-optimal frontier (the sweet spot):
Given a fixed compute budget C (in FLOPs), the optimal allocation is:
  N_opt ≈ C^0.73 / (6 * compute_efficiency_factor)
  D_opt ≈ C^0.27 / (1 / tokens_per_flop)

Compute as a function of model and data:
  C ≈ 6 * N * D  (approximately)

Key Insight: The loss curves are smooth, predictable, and log-linear (on a log-log plot). No surprises. No phase transitions. Scale works reliably.

What this meant for GPT-3:

OpenAI could predict: “If we train a 175B parameter model on 300B tokens, we’ll get loss of X and benchmark performance of Y.”
The scaling laws guided the design of GPT-3.
Later research (Chinchilla, 2022) refined these laws, showing GPT-3 was slightly compute-suboptimal (it used more compute than necessary for its parameter count).

The Indian Analogy

Imagine a factory producing clothes. You have a fixed budget to spend on:

Workers (parameters): Each worker can do more or better work with training.
Raw materials (data): More fabric, more patterns, more examples.
Total budget (compute): You have a limited amount of money to spend.

The question: How should you split your budget between hiring workers and buying materials?

If you hire too many workers with too little material, they’re idle. Waste.
If you buy too much material with too few workers, it piles up. Waste.

There’s an optimal ratio. The scaling laws tell you: for a given budget, hire 73% of it as worker-years and spend 27% on materials. Or more precisely, scale workers proportional to budget^0.73 and materials proportional to budget^0.27.

If you double your budget, you should:

Roughly double the workers (N ∝ C^0.73)
Increase materials by ~65% (D ∝ C^0.27)

The scaling laws are the factory optimization handbook.

Comparison: Before Scaling Laws vs. After

Aspect	Before Scaling Laws (2019)	After Scaling Laws (2020)
Assumption	Larger models = diminishing returns	Scale reliably improves loss
Design question	”What architecture should we use?"	"How large should we go?”
Research focus	Architecture innovation	Scale experiments
Model size limit	~2B parameters (conventional wisdom)	Hundreds of billions (justified by laws)
Confidence in scale	Low; risky	High; predictable
Planning tool	None	Power-law extrapolation
Field impact	Incremental improvements	Paradigm shift to scaling

Read in This Order

Section	What You Will Learn	Difficulty	Time
01. Context	The pre-2020 skepticism about scale; why the field thought bigger = worse returns	🟢 Beginner	7 min
02. The Problem	How to allocate compute optimally? How much does scale help?	🟡 Intermediate	6 min
03. The Idea	Power laws; log-linear relationships; why they’re surprising and useful	🟡 Intermediate	9 min
04. The Math	The equations; worked examples with real numbers; compute-optimal frontier	🟡 Intermediate	11 min
05. Worked Example	Training three models (small, medium, large) and verifying the power law	🔴 Advanced	12 min
06. The Code	Simulating scaling laws; plotting on log-log axes	🟡 Intermediate	7 min
07. Limitations	Scaling laws break at extreme scales; compute-optimal ratio was later refined	🟢 Beginner	6 min
08. Impact	Chinchilla, LLaMA, GPT-4 all designed using scaling laws	🟢 Beginner	5 min
09. Summary	One-pager recap	🟢 Beginner	3 min

Before You Read: Math Tutorials You Need

Mean, Variance, and Standard Deviation (we’ll use variance to understand loss spread)
Cross-Entropy Loss (the metric we’re scaling)
Power Laws and Logarithms (essential for understanding the equations)

Architecture Diagram: Scaling Experiment Layout

The Scaling Laws Study: Hundreds of Experiments
═════════════════════════════════════════════════

Vary 3 dimensions:

1. MODEL SIZE (N parameters)
   ├─ 1M params
   ├─ 10M params
   ├─ 100M params
   ├─ 1B params
   ├─ 10B params
   ├─ 100B params
   └─ 175B params (GPT-3)

2. DATASET SIZE (D tokens)
   ├─ 10M tokens
   ├─ 100M tokens
   ├─ 1B tokens
   ├─ 10B tokens
   ├─ 100B tokens
   └─ 300B tokens (GPT-3)

3. COMPUTE BUDGET (C FLOPs)
   ├─ 10^16 FLOPs
   ├─ 10^17 FLOPs
   ├─ 10^18 FLOPs
   ├─ 10^19 FLOPs
   ├─ 10^20 FLOPs
   └─ 10^21 FLOPs

For each configuration:
   Train model → Measure loss on test set
   
Result:
   L(N) follows power law: L ∝ N^(-0.076)
   L(D) follows power law: L ∝ D^(-0.103)
   L(C) follows power law: L ∝ C^(-0.16)
   
When plotted on log-log axes: straight lines

← Previous: Paper 12: GPT-3
Next → Paper 14: Chain-of-Thought (Coming Soon)

Jump to section:

Glossary | Quiz | Further Reading

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models

What This Paper Did

The Indian Analogy

Comparison: Before Scaling Laws vs. After

Read in This Order

Before You Read: Math Tutorials You Need

Architecture Diagram: Scaling Experiment Layout

Navigation

Discussion