Context: Why the Field Doubted Scale
Before 2020, the dominant belief in AI was: scaling hits diminishing returns quickly.
The Conventional Wisdom (2019)
By 2019, researchers had trained language models up to ~2–3 billion parameters (BERT, RoBERTa, GPT-2). The performance improvements were solid but slowing. The narrative became:
“There’s an optimal model size for a given amount of data. Beyond that, you get diminishing returns. A 10B parameter model trained on the same data as a 1B model won’t be 10x better; maybe 10% better.”
Why people believed this:
- Theoretical machine learning often shows diminishing returns (more features → harder to learn with limited data).
- Overfitting: a 10B model trained on 10B tokens of data would overfit terribly. The model memorizes the data instead of learning generalizable patterns.
- Compute was expensive and limited. It made sense to ask: “Is it worth it?”
The Paradox
But there was a paradox. In some domains, scale was clearly helping:
- Computer vision: ImageNet competitions showed that larger models (ResNets, then Vision Transformers) beat smaller ones, even when trained on the same data.
- Speech recognition: Larger speech models trained on more audio improved steadily.
- Recommendation systems: Scaling embedding dimensions and hidden layers improved rankings.
Why was scale working elsewhere but not in language modeling?
The honest answer: Nobody knew. There were different hypotheses:
Hypothesis 1: Language is different. LMs hit a wall; vision doesn’t.
Hypothesis 2: We’re not scaling correctly. We’re scaling model size without scaling data proportionally.
Hypothesis 3: Scale works, but we haven’t tried hard enough. We need billions of tokens and billions of parameters to see the trend.
The Pre-Scaling Era
Before 2020, scaling experiments were limited:
- Compute was expensive. Training a 2B parameter model on 300B tokens cost $100,000+.
- Data was scarce. High-quality datasets had millions or tens of millions of examples, not billions.
- No clear playbook. If you doubled model size, did you double data size? Keep it the same? Researchers guessed.
The field was in a “local minimum”: not confident that scale would help, so didn’t invest heavily in scaling, so didn’t have data showing scale helps.
Enter OpenAI: The Scaling Hypothesis
By 2019–2020, OpenAI had:
- Significant compute resources (multiple GPU clusters)
- A data pipeline (WebText2, Common Crawl, books)
- A hypothesis: scale works in language modeling too, if you do it right
The team (led by Jared Kaplan, Tom Henighan, and others) decided to run the experiment: train dozens of language models at different scales and measure the exact relationship between size and performance.
They had no guarantee it would reveal smooth, predictable relationships. It could have been chaos. But it wasn’t.
What Makes This Risky and Interesting
Scaling experiments are expensive ($millions to run). If the results were chaotic—no clear pattern—the money would be wasted. But if the results were smooth and predictable, it would justify even larger experiments (like GPT-3).
Stakes: This paper needed to show that scale is predictable and reliable for anyone to bet billions on a 175B parameter model.
That’s what made the findings so important.
Key Takeaways from This Section
- Pre-2020 belief: Scale hits diminishing returns; bigger models need proportionally more data to avoid overfitting.
- The paradox: Scale worked in vision and speech, but nobody had proven it for language.
- The question: Is language modeling special (scale doesn’t help much)? Or have we just not tried hard enough?
- OpenAI’s bet: Run systematic scaling experiments. Find the mathematical relationship. Prove scale is reliable.
Next: Section 02: The Problem