This paper (August 2024) was the theoretical foundation for something the world saw commercialized just weeks later.
OpenAI o1: The First “Reasoning Model”
In September 2024, OpenAI released o1, and the tech world noticed: the model was spending significant time “thinking” before answering. It would show a chain-of-thought (CoT) reasoning trace, internally solving the problem step by step, then present the final answer. For hard math, coding, and science problems, o1 vastly outperformed GPT-4o.
What was o1 doing? Exactly what this paper described — allocating more compute at test time (specifically, generating longer internal reasoning sequences with a learned reward model guiding the search). The o1 system card confirms: the model uses a form of best-of-N selection with internal scoring to pick high-quality reasoning paths.
o1’s impact was immediate: it redefined user expectations. Suddenly, “thinking harder” before answering became a visible, valued feature, not a bug or waste.
DeepSeek R1: Open-Source Validation
In January 2025 (the same month as rStar-Math), DeepSeek released R1, an open-source reasoning model using a similar extended-thinking paradigm. R1 proved that the strategy works beyond OpenAI’s proprietary data and methods — any lab with sufficient compute could train a reasoning model using test-time compute principles.
The research directions are now clear to the entire community:
- Use MCTS or similar search to generate training data
- Train models on high-quality reasoning traces
- At inference, allow the model to “think long” on hard questions
Google Gemini 2.0 Flash Thinking
Google released its own version (Gemini 2.0 Flash Thinking), continuing the pattern. By early 2025, every major AI lab had a “thinking model” variant — a public acknowledgment that test-time compute scaling had become standard practice.
System 2 Thinking: The Kahneman Connection
Daniel Kahneman’s “Thinking, Fast and Slow” distinguishes:
- System 1: Fast, automatic, intuitive (immediate response)
- System 2: Slow, deliberate, logical (deep reasoning)
Test-time compute is giving AI a System 2 mode. The base model (System 1) can solve easy questions instantly. For hard questions, the model engages System 2 — extended reasoning, internal verification, self-critique — before committing to an answer.
This framing has been widely adopted in research and product announcements. It’s intuitive to users: “The AI thinks carefully about this.”
Research Directions Opened
The paper’s success unlocked entire research directions:
Monte Carlo Tree Search for LLMs: The next breakthrough (rStar-Math, Paper 24) applies MCTS explicitly to math reasoning, showing that structured search beats random sampling. If MCTS beats best-of-N, what about other search strategies?
Beam search and speculative decoding: Can you use beam search (explore multiple reasoning hypotheses in parallel) instead of sequential best-of-N? Early results suggest yes — Beam Search beats Best-of-N for code generation.
Self-critique loops: Can the model learn to criticize and revise its own reasoning? DeepSeek R1 includes a self-critique phase (“I think I made an error, let me reconsider…”).
Domain-specific verifiers: For tasks beyond math, can you train lightweight verifiers? Google’s Gemini 2.0 uses verifiers for code correctness checks.
The Bigger Shift
The industry is moving from “bigger model = better” to “smarter compute allocation = better.” A 7B model with optimal test-time compute can rival a 70B model. This has profound implications:
- Democratisation: Smaller labs and universities can build competitive reasoning models without unlimited resources
- Efficiency: Spend compute where you need it (hard problems get more thinking time)
- Latency-aware design: Users wait longer only for genuinely hard questions
- Iterative improvement: Models improve through data self-generation, not just scale
What This Means for You
If you’re building AI systems, the lesson is clear: don’t stop at base model performance. Invest in inference-time compute scaling — verifiers, search strategies, self-critique. The frontier models all do this now.
If you’re studying AI, understand that reasoning isn’t just about model size anymore. It’s about how intelligently you allocate compute during the reasoning process.
The next leap (Paper 24 — rStar-Math) will show that this principle is even more powerful than we initially thought.