Paper 13
Intermediate

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models

Authors: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, and others
Venue: OpenAI Technical Report (arXiv)
Year: 2020
URL: https://arxiv.org/abs/2001.08361


What This Paper Did

Run hundreds of language model training experiments at varying scales (different parameter counts, dataset sizes, and compute budgets). The result: smooth power-law relationships.

The core finding: As you increase model size (N parameters) or dataset size (D tokens), loss decreases predictably according to power laws. These laws hold across 7 orders of magnitude (from millions to hundreds of billions of parameters).

The Key Equations:

Cross-entropy loss as a function of model size:
L(N) = (N_0 / N)^α_N

where:
  L(N) = cross-entropy loss
  N = number of model parameters
  N_0 = a reference parameter count
  α_N ≈ 0.076 (the exponent, empirically estimated)

Cross-entropy loss as a function of dataset size:
L(D) = (D_0 / D)^α_D

where:
  D = number of training tokens
  D_0 = a reference dataset size
  α_D ≈ 0.103 (the exponent)

Compute-optimal frontier (the sweet spot):
Given a fixed compute budget C (in FLOPs), the optimal allocation is:
  N_opt ≈ C^0.73 / (6 * compute_efficiency_factor)
  D_opt ≈ C^0.27 / (1 / tokens_per_flop)

Compute as a function of model and data:
  C ≈ 6 * N * D  (approximately)

Key Insight: The loss curves are smooth, predictable, and log-linear (on a log-log plot). No surprises. No phase transitions. Scale works reliably.

What this meant for GPT-3:

  • OpenAI could predict: “If we train a 175B parameter model on 300B tokens, we’ll get loss of X and benchmark performance of Y.”
  • The scaling laws guided the design of GPT-3.
  • Later research (Chinchilla, 2022) refined these laws, showing GPT-3 was slightly compute-suboptimal (it used more compute than necessary for its parameter count).

The Indian Analogy

Imagine a factory producing clothes. You have a fixed budget to spend on:

  • Workers (parameters): Each worker can do more or better work with training.
  • Raw materials (data): More fabric, more patterns, more examples.
  • Total budget (compute): You have a limited amount of money to spend.

The question: How should you split your budget between hiring workers and buying materials?

If you hire too many workers with too little material, they’re idle. Waste.
If you buy too much material with too few workers, it piles up. Waste.

There’s an optimal ratio. The scaling laws tell you: for a given budget, hire 73% of it as worker-years and spend 27% on materials. Or more precisely, scale workers proportional to budget^0.73 and materials proportional to budget^0.27.

If you double your budget, you should:

  • Roughly double the workers (N ∝ C^0.73)
  • Increase materials by ~65% (D ∝ C^0.27)

The scaling laws are the factory optimization handbook.


Comparison: Before Scaling Laws vs. After

AspectBefore Scaling Laws (2019)After Scaling Laws (2020)
AssumptionLarger models = diminishing returnsScale reliably improves loss
Design question”What architecture should we use?""How large should we go?”
Research focusArchitecture innovationScale experiments
Model size limit~2B parameters (conventional wisdom)Hundreds of billions (justified by laws)
Confidence in scaleLow; riskyHigh; predictable
Planning toolNonePower-law extrapolation
Field impactIncremental improvementsParadigm shift to scaling

Read in This Order

SectionWhat You Will LearnDifficultyTime
01. ContextThe pre-2020 skepticism about scale; why the field thought bigger = worse returns🟢 Beginner7 min
02. The ProblemHow to allocate compute optimally? How much does scale help?🟡 Intermediate6 min
03. The IdeaPower laws; log-linear relationships; why they’re surprising and useful🟡 Intermediate9 min
04. The MathThe equations; worked examples with real numbers; compute-optimal frontier🟡 Intermediate11 min
05. Worked ExampleTraining three models (small, medium, large) and verifying the power law🔴 Advanced12 min
06. The CodeSimulating scaling laws; plotting on log-log axes🟡 Intermediate7 min
07. LimitationsScaling laws break at extreme scales; compute-optimal ratio was later refined🟢 Beginner6 min
08. ImpactChinchilla, LLaMA, GPT-4 all designed using scaling laws🟢 Beginner5 min
09. SummaryOne-pager recap🟢 Beginner3 min

Before You Read: Math Tutorials You Need


Architecture Diagram: Scaling Experiment Layout

The Scaling Laws Study: Hundreds of Experiments
═════════════════════════════════════════════════

Vary 3 dimensions:

1. MODEL SIZE (N parameters)
   ├─ 1M params
   ├─ 10M params
   ├─ 100M params
   ├─ 1B params
   ├─ 10B params
   ├─ 100B params
   └─ 175B params (GPT-3)

2. DATASET SIZE (D tokens)
   ├─ 10M tokens
   ├─ 100M tokens
   ├─ 1B tokens
   ├─ 10B tokens
   ├─ 100B tokens
   └─ 300B tokens (GPT-3)

3. COMPUTE BUDGET (C FLOPs)
   ├─ 10^16 FLOPs
   ├─ 10^17 FLOPs
   ├─ 10^18 FLOPs
   ├─ 10^19 FLOPs
   ├─ 10^20 FLOPs
   └─ 10^21 FLOPs

For each configuration:
   Train model → Measure loss on test set
   
Result:
   L(N) follows power law: L ∝ N^(-0.076)
   L(D) follows power law: L ∝ D^(-0.103)
   L(C) follows power law: L ∝ C^(-0.16)
   
When plotted on log-log axes: straight lines

← Previous: Paper 12: GPT-3
Next → Paper 14: Chain-of-Thought (Coming Soon)

Jump to section:

Glossary | Quiz | Further Reading

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.