Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Authors: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, and others
Venue: OpenAI Technical Report (arXiv)
Year: 2020
URL: https://arxiv.org/abs/2001.08361
What This Paper Did
Run hundreds of language model training experiments at varying scales (different parameter counts, dataset sizes, and compute budgets). The result: smooth power-law relationships.
The core finding: As you increase model size (N parameters) or dataset size (D tokens), loss decreases predictably according to power laws. These laws hold across 7 orders of magnitude (from millions to hundreds of billions of parameters).
The Key Equations:
Cross-entropy loss as a function of model size:
L(N) = (N_0 / N)^α_N
where:
L(N) = cross-entropy loss
N = number of model parameters
N_0 = a reference parameter count
α_N ≈ 0.076 (the exponent, empirically estimated)
Cross-entropy loss as a function of dataset size:
L(D) = (D_0 / D)^α_D
where:
D = number of training tokens
D_0 = a reference dataset size
α_D ≈ 0.103 (the exponent)
Compute-optimal frontier (the sweet spot):
Given a fixed compute budget C (in FLOPs), the optimal allocation is:
N_opt ≈ C^0.73 / (6 * compute_efficiency_factor)
D_opt ≈ C^0.27 / (1 / tokens_per_flop)
Compute as a function of model and data:
C ≈ 6 * N * D (approximately)
Key Insight: The loss curves are smooth, predictable, and log-linear (on a log-log plot). No surprises. No phase transitions. Scale works reliably.
What this meant for GPT-3:
- OpenAI could predict: “If we train a 175B parameter model on 300B tokens, we’ll get loss of X and benchmark performance of Y.”
- The scaling laws guided the design of GPT-3.
- Later research (Chinchilla, 2022) refined these laws, showing GPT-3 was slightly compute-suboptimal (it used more compute than necessary for its parameter count).
The Indian Analogy
Imagine a factory producing clothes. You have a fixed budget to spend on:
- Workers (parameters): Each worker can do more or better work with training.
- Raw materials (data): More fabric, more patterns, more examples.
- Total budget (compute): You have a limited amount of money to spend.
The question: How should you split your budget between hiring workers and buying materials?
If you hire too many workers with too little material, they’re idle. Waste.
If you buy too much material with too few workers, it piles up. Waste.
There’s an optimal ratio. The scaling laws tell you: for a given budget, hire 73% of it as worker-years and spend 27% on materials. Or more precisely, scale workers proportional to budget^0.73 and materials proportional to budget^0.27.
If you double your budget, you should:
- Roughly double the workers (N ∝ C^0.73)
- Increase materials by ~65% (D ∝ C^0.27)
The scaling laws are the factory optimization handbook.
Comparison: Before Scaling Laws vs. After
| Aspect | Before Scaling Laws (2019) | After Scaling Laws (2020) |
|---|---|---|
| Assumption | Larger models = diminishing returns | Scale reliably improves loss |
| Design question | ”What architecture should we use?" | "How large should we go?” |
| Research focus | Architecture innovation | Scale experiments |
| Model size limit | ~2B parameters (conventional wisdom) | Hundreds of billions (justified by laws) |
| Confidence in scale | Low; risky | High; predictable |
| Planning tool | None | Power-law extrapolation |
| Field impact | Incremental improvements | Paradigm shift to scaling |
Read in This Order
| Section | What You Will Learn | Difficulty | Time |
|---|---|---|---|
| 01. Context | The pre-2020 skepticism about scale; why the field thought bigger = worse returns | 🟢 Beginner | 7 min |
| 02. The Problem | How to allocate compute optimally? How much does scale help? | 🟡 Intermediate | 6 min |
| 03. The Idea | Power laws; log-linear relationships; why they’re surprising and useful | 🟡 Intermediate | 9 min |
| 04. The Math | The equations; worked examples with real numbers; compute-optimal frontier | 🟡 Intermediate | 11 min |
| 05. Worked Example | Training three models (small, medium, large) and verifying the power law | 🔴 Advanced | 12 min |
| 06. The Code | Simulating scaling laws; plotting on log-log axes | 🟡 Intermediate | 7 min |
| 07. Limitations | Scaling laws break at extreme scales; compute-optimal ratio was later refined | 🟢 Beginner | 6 min |
| 08. Impact | Chinchilla, LLaMA, GPT-4 all designed using scaling laws | 🟢 Beginner | 5 min |
| 09. Summary | One-pager recap | 🟢 Beginner | 3 min |
Before You Read: Math Tutorials You Need
- Mean, Variance, and Standard Deviation (we’ll use variance to understand loss spread)
- Cross-Entropy Loss (the metric we’re scaling)
- Power Laws and Logarithms (essential for understanding the equations)
Architecture Diagram: Scaling Experiment Layout
The Scaling Laws Study: Hundreds of Experiments
═════════════════════════════════════════════════
Vary 3 dimensions:
1. MODEL SIZE (N parameters)
├─ 1M params
├─ 10M params
├─ 100M params
├─ 1B params
├─ 10B params
├─ 100B params
└─ 175B params (GPT-3)
2. DATASET SIZE (D tokens)
├─ 10M tokens
├─ 100M tokens
├─ 1B tokens
├─ 10B tokens
├─ 100B tokens
└─ 300B tokens (GPT-3)
3. COMPUTE BUDGET (C FLOPs)
├─ 10^16 FLOPs
├─ 10^17 FLOPs
├─ 10^18 FLOPs
├─ 10^19 FLOPs
├─ 10^20 FLOPs
└─ 10^21 FLOPs
For each configuration:
Train model → Measure loss on test set
Result:
L(N) follows power law: L ∝ N^(-0.076)
L(D) follows power law: L ∝ D^(-0.103)
L(C) follows power law: L ∝ C^(-0.16)
When plotted on log-log axes: straight lines
Navigation
← Previous: Paper 12: GPT-3
Next → Paper 14: Chain-of-Thought (Coming Soon)
Jump to section:
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.