Context: The Reasoning Problem

By 2022, large language models had conquered language understanding. GPT-3 (175 billion parameters) could write essays, answer trivia, summarize documents, and even code. Scaling laws (Paper 13) showed that bigger always meant better. So why couldn’t these giants solve a problem a Class 5 student could solve in seconds?

The Failure Case

Give GPT-3 this question:

James has 15 apples. Mary gives him 8 more. James gives 5 apples to his friend. How many apples does James have now?

GPT-3 produces: 18 apples.

Correct answer: 18 apples. Lucky guess.

But ask it a slightly harder variant:

James has 15 apples. Mary gives him 8 more. James gives half of what he now has to his friend. Then he eats 2 apples. How many apples does James have now?

GPT-3 might say: 15 apples. Then: 20 apples. Then: 25 apples. Three different answers, all wrong.

The problem wasn’t that GPT-3 lacked knowledge. It had seen arithmetic problems in its training data. The problem was that GPT-3 was a next-token predictor. It excelled at pattern-matching: “apple questions usually end with a number.” It could guess close to the right distribution. But it couldn’t reason — it couldn’t reliably execute a multi-step logical process.

Why This Matters

Reasoning is not a luxury in AI. It’s essential for:

Math problems: Every step must be correct; one mistake cascades.
Logic puzzles: Needs to chain ideas together across sentences.
Code: Must trace values through multiple lines.
Common sense: “If A is true and B follows from A, then B is true.”

Without reasoning, an AI is a sophisticated autocomplete — good for finishing sentences, useless for solving problems.

The Prevailing Theory (2021–early 2022)

The consensus was: scaling alone won’t solve reasoning. Researchers believed that language models fundamentally lacked the architecture for step-by-step reasoning. You’d need:

Symbolic reasoning engines (like old-school AI)
Special training objectives (not just next-token prediction)
Explicit problem-solving modules

GPT-3 and similar models were seen as ceiling-hitting — they’d plateaued on multi-step reasoning no matter how big you made them.

The Team and the Paper

In January 2022, a team at Google (Wei, Wang, Schuurmans, Bosma, Ichter, Chi, Le, Zhou and others) asked a simple question: What if the model just needs to see the reasoning?

They weren’t trying to change the architecture. They weren’t retraining with new objectives. They just modified the prompts — the few-shot examples shown to the model — to include intermediate reasoning steps.

The result upended conventional wisdom: reasoning wasn’t missing in large models. It was sleeping. The model had learned to reason from its training data. It just needed permission — and a demonstration — to use that capability.

This was the breakthrough: not bigger models, not new algorithms, but better prompts. The reasoning was already there. You just had to ask the model to show its work.