From Theory to Reality: The Reasoning Model Era — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters

This paper (August 2024) was the theoretical foundation for something the world saw commercialized just weeks later.

OpenAI o1: The First “Reasoning Model”

In September 2024, OpenAI released o1, and the tech world noticed: the model was spending significant time “thinking” before answering. It would show a chain-of-thought (CoT) reasoning trace, internally solving the problem step by step, then present the final answer. For hard math, coding, and science problems, o1 vastly outperformed GPT-4o.

What was o1 doing? Exactly what this paper described — allocating more compute at test time (specifically, generating longer internal reasoning sequences with a learned reward model guiding the search). The o1 system card confirms: the model uses a form of best-of-N selection with internal scoring to pick high-quality reasoning paths.

o1’s impact was immediate: it redefined user expectations. Suddenly, “thinking harder” before answering became a visible, valued feature, not a bug or waste.

DeepSeek R1: Open-Source Validation

In January 2025 (the same month as rStar-Math), DeepSeek released R1, an open-source reasoning model using a similar extended-thinking paradigm. R1 proved that the strategy works beyond OpenAI’s proprietary data and methods — any lab with sufficient compute could train a reasoning model using test-time compute principles.

The research directions are now clear to the entire community:

Use MCTS or similar search to generate training data
Train models on high-quality reasoning traces
At inference, allow the model to “think long” on hard questions

Google Gemini 2.0 Flash Thinking

Google released its own version (Gemini 2.0 Flash Thinking), continuing the pattern. By early 2025, every major AI lab had a “thinking model” variant — a public acknowledgment that test-time compute scaling had become standard practice.

System 2 Thinking: The Kahneman Connection

Daniel Kahneman’s “Thinking, Fast and Slow” distinguishes:

System 1: Fast, automatic, intuitive (immediate response)
System 2: Slow, deliberate, logical (deep reasoning)

Test-time compute is giving AI a System 2 mode. The base model (System 1) can solve easy questions instantly. For hard questions, the model engages System 2 — extended reasoning, internal verification, self-critique — before committing to an answer.

This framing has been widely adopted in research and product announcements. It’s intuitive to users: “The AI thinks carefully about this.”

Research Directions Opened

The paper’s success unlocked entire research directions:

Monte Carlo Tree Search for LLMs: The next breakthrough (rStar-Math, Paper 24) applies MCTS explicitly to math reasoning, showing that structured search beats random sampling. If MCTS beats best-of-N, what about other search strategies?

Beam search and speculative decoding: Can you use beam search (explore multiple reasoning hypotheses in parallel) instead of sequential best-of-N? Early results suggest yes — Beam Search beats Best-of-N for code generation.

Self-critique loops: Can the model learn to criticize and revise its own reasoning? DeepSeek R1 includes a self-critique phase (“I think I made an error, let me reconsider…”).

Domain-specific verifiers: For tasks beyond math, can you train lightweight verifiers? Google’s Gemini 2.0 uses verifiers for code correctness checks.

The Bigger Shift

The industry is moving from “bigger model = better” to “smarter compute allocation = better.” A 7B model with optimal test-time compute can rival a 70B model. This has profound implications:

Democratisation: Smaller labs and universities can build competitive reasoning models without unlimited resources
Efficiency: Spend compute where you need it (hard problems get more thinking time)
Latency-aware design: Users wait longer only for genuinely hard questions
Iterative improvement: Models improve through data self-generation, not just scale

What This Means for You

If you’re building AI systems, the lesson is clear: don’t stop at base model performance. Invest in inference-time compute scaling — verifiers, search strategies, self-critique. The frontier models all do this now.

If you’re studying AI, understand that reasoning isn’t just about model size anymore. It’s about how intelligently you allocate compute during the reasoning process.

The next leap (Paper 24 — rStar-Math) will show that this principle is even more powerful than we initially thought.