Context: The Reasoning Problem
By 2022, large language models had conquered language understanding. GPT-3 (175 billion parameters) could write essays, answer trivia, summarize documents, and even code. Scaling laws (Paper 13) showed that bigger always meant better. So why couldn’t these giants solve a problem a Class 5 student could solve in seconds?
The Failure Case
Give GPT-3 this question:
James has 15 apples. Mary gives him 8 more. James gives 5 apples to his friend. How many apples does James have now?
GPT-3 produces: 18 apples.
Correct answer: 18 apples. Lucky guess.
But ask it a slightly harder variant:
James has 15 apples. Mary gives him 8 more. James gives half of what he now has to his friend. Then he eats 2 apples. How many apples does James have now?
GPT-3 might say: 15 apples. Then: 20 apples. Then: 25 apples. Three different answers, all wrong.
The problem wasn’t that GPT-3 lacked knowledge. It had seen arithmetic problems in its training data. The problem was that GPT-3 was a next-token predictor. It excelled at pattern-matching: “apple questions usually end with a number.” It could guess close to the right distribution. But it couldn’t reason — it couldn’t reliably execute a multi-step logical process.
Why This Matters
Reasoning is not a luxury in AI. It’s essential for:
- Math problems: Every step must be correct; one mistake cascades.
- Logic puzzles: Needs to chain ideas together across sentences.
- Code: Must trace values through multiple lines.
- Common sense: “If A is true and B follows from A, then B is true.”
Without reasoning, an AI is a sophisticated autocomplete — good for finishing sentences, useless for solving problems.
The Prevailing Theory (2021–early 2022)
The consensus was: scaling alone won’t solve reasoning. Researchers believed that language models fundamentally lacked the architecture for step-by-step reasoning. You’d need:
- Symbolic reasoning engines (like old-school AI)
- Special training objectives (not just next-token prediction)
- Explicit problem-solving modules
GPT-3 and similar models were seen as ceiling-hitting — they’d plateaued on multi-step reasoning no matter how big you made them.
The Team and the Paper
In January 2022, a team at Google (Wei, Wang, Schuurmans, Bosma, Ichter, Chi, Le, Zhou and others) asked a simple question: What if the model just needs to see the reasoning?
They weren’t trying to change the architecture. They weren’t retraining with new objectives. They just modified the prompts — the few-shot examples shown to the model — to include intermediate reasoning steps.
The result upended conventional wisdom: reasoning wasn’t missing in large models. It was sleeping. The model had learned to reason from its training data. It just needed permission — and a demonstration — to use that capability.
This was the breakthrough: not bigger models, not new algorithms, but better prompts. The reasoning was already there. You just had to ask the model to show its work.