Further Reading: Paper 13 (Scaling Laws)

Deepen your understanding of scaling laws and their applications.

The Original Paper

Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Aditya Ramesh, Prafulla Dhariwal, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei
Published: January 2020, arXiv
URL: https://arxiv.org/abs/2001.08361

The full technical report with all experiments and detailed results. Dense but authoritative. Focus on:

Section 2: “Experimental Setup” (how they ran hundreds of experiments)
Section 3: “Findings” (the power laws and exponents)
Section 4: “Discussion” (implications and limitations)

Essential Follow-Up Papers

1. Training Compute-Optimal Large Language Models (Chinchilla)

Authors: Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. (DeepMind)
Published: March 2022, arXiv
URL: https://arxiv.org/abs/2203.15556

Refined the scaling laws using newer experiments at larger scales. Found the optimal parameter-to-token ratio is different from the original paper’s suggestion. Critical for understanding modern LLM design.

Key finding: For the same compute, use more data and fewer parameters than the original paper suggested.

2. LLaMA: Open and Efficient Foundation Language Models

Authors: Hugo Touvron, Thibaut Lavril, Gautier Izacar, et al. (Meta)
Published: February 2023, arXiv
URL: https://arxiv.org/abs/2302.13971

Applied the scaling laws (Chinchilla-optimized) to train open-source models. Shows how the laws guide practical model design. Excellent example of the paper’s impact.

3. Emergent Abilities of Large Language Models

Authors: Jason Wei, Yi Tay, Rishi Bommasani, et al. (Google Brain)
Published: October 2022, arXiv
URL: https://arxiv.org/abs/2206.07682

Studied which capabilities emerge at which scales. Shows that loss (which scales smoothly) doesn’t perfectly predict benchmark performance (which can jump). Complements the scaling laws.

Deeper Understanding

Blog Posts and Tutorials

“The Bitter Lesson” — Richard Sutton (Reinforcement Learning Perspectives)
URL: http://www.incompleteideas.net/IncompleteIdeas/BitterLesson.html

Not directly about scaling laws, but philosophically aligned. Argues that scale and computation, not hand-crafted features, drive progress in AI. Historical perspective.

“Scaling Laws for Language Models” — Hugging Face Blog
URL: https://huggingface.co/blog/large-language-models

Accessible explanation of the scaling laws with code examples and visualizations.

“A Primer on Neural Network Architectures for Natural Language Processing” — Yoav Goldberg
URL: https://arxiv.org/abs/1510.00726

Older paper, but provides context on how language models work before scaling laws became dominant. Good for understanding the shift in perspective.

Benchmarks and Datasets

The paper measured loss on held-out test sets. Key datasets:

Dataset	Task	Size	Reference
WebText2	Language modeling	~19B tokens	Used in GPT-3 and scaling laws
Common Crawl	Web text	Petabytes (subset used)	Public web crawl
BookCorpus	Books	~1B tokens	Project Gutenberg + other sources
Wikipedia	Encyclopedia articles	~4B tokens	English Wikipedia

These are the primary pre-training sources for models in the scaling laws paper.

Code and Models

Implementing Scaling Laws

Hugging Face Transformers:
URL: https://huggingface.co/transformers/
Use pre-trained models and measure their loss on your test set. Plot on log-log axes to verify power-law relationships.

JAX / Flax:
URL: https://github.com/google/flax
Framework used by some labs for large-scale model training. Good for reproducing scaling law experiments.

LitGPT:
URL: https://github.com/Lightning-AI/litgpt
Fine-tune open-source language models. Useful for exploring scaling effects on downstream tasks.

Open-Source Models Trained with Scaling Laws

LLaMA and LLaMA 2 (Meta):
https://github.com/facebookresearch/llama
Fully open-source models designed using Chinchilla-optimal allocation.

BLOOM (BigScience):
https://huggingface.co/bigscience/bloom
Multilingual model trained using scaling-law principles.

Mistral (Mistral AI):
https://mistral.ai/
Smaller, efficient models using modern scaling insights.

Optimal Scaling Beyond Parameters and Data

More Than Capacity: Fairness and Calibration in Deep Image Classifiers
Studies scaling behavior in vision. Shows power laws appear across domains (not just NLP).

Chinchilla’s Wild Implications
Informal blog post analyzing implications of Chinchilla’s finding that models are often undertrained.

The Scaling Limits of Large Language Models
Explores what happens at extreme scales and whether power laws continue to hold.

Scaling Laws for Other Domains

Scaling Laws for Vision Transformers
Do Vision Transformers follow the same power laws as language models?

Scaling Laws for Multimodal Models
How do scaling laws apply to models combining text and images?

Key Insights to Carry Forward

Scale is predictable: Performance doesn’t plateau; it follows mathematical laws.
Compute allocation matters: More is not always better; optimal allocations exist.
Laws are tools, not prophecies: They guide planning but have limits (data quality, extreme scales, benchmark performance).
Interdisciplinary: Scaling laws appeared in vision, speech, RL—it’s a universal principle.
Enables competition: Smaller labs can be compute-efficient even without the largest budgets.

Questions for Deeper Exploration

Why power laws? What fundamental property of neural networks leads to power-law scaling?
Universal constants: Do α_N = 0.076 and α_D = 0.103 hold for all Transformer variants?
Data quality: How would you formally model data quality in the scaling laws?
Emergent abilities: Why does loss scale smoothly while benchmark performance jumps?
Inference scaling: Are there scaling laws for inference (decoding) cost?
Optimization: Does the optimization algorithm (SGD, Adam, etc.) affect the exponents?

Where to Find More Research

arXiv (arxiv.org): Search “scaling laws language models”
Papers with Code (paperswithcode.com): Scaling law benchmarks and reproductions
Hugging Face Hub (huggingface.co): Model cards often cite scaling laws in design decisions
OpenAI Blog: Technical insights from OpenAI’s scaling experiments
DeepMind Blog: DeepMind’s Chinchilla and follow-up work

Paper 12: GPT-3 — The model that scaling laws justified
Paper 14: Chain-of-Thought Prompting (coming) — How scaling enables reasoning
Paper 11: BERT — Pre-scaling-era architecture
Paper 10: GPT-1 — The beginning

Historical Context

Before 2020, the field believed:

“Transformers work, but scaling has limits.”
“After 2B parameters, gains plateau.”
“More data is not always better.”

After this paper:

“Scale smoothly improves performance.”
“Use the compute-optimal frontier.”
“Data is as important as parameters.”

This shift is one of the most important in modern AI research.

Final Thought

Scaling laws are not exciting by themselves. No new architecture. No clever trick. Just: “If you plot loss vs. size on a log-log axis, you get a straight line.”

But that simplicity—measuring something obvious and finding it holds reliably—unlocked billions in investment and a decade of progress. Sometimes the most impactful research is the most straightforward.

Good luck with your exploration!