Paper 24

Further Reading: MCTS, Self-Evolution, and Beyond

The Original Paper

“rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking”

  • Authors: Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Ruofei Zhang, Yin Zhang, Mao Yang, Weizhu Chen
  • Organisation: Microsoft Research Asia
  • Published: arXiv 2501.04519 (January 2025)
  • Link: https://arxiv.org/abs/2501.04519
  • Key results: 7B model reaches 90% on MATH through 4 rounds of self-evolution

Essential Prerequisites and Companions

Let’s Verify Step by Step: Improving LLM Correctness via Iterative Verification Processes (Paper 16)

  • The foundation for Process Reward Models (PRMs)
  • Crucial to understand before reading rStar-Math
  • Shows how to score intermediate reasoning steps

Scaling LLM Test-Time Compute Optimally (Paper 23)

  • Directly precedes this paper in the ainiketan series
  • Explains why inference-time computation matters
  • Sets up the motivation for rStar-Math’s approach

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)

  • arXiv:2201.11903
  • Foundation for understanding why step-by-step reasoning works
  • Paper 14 in this series

Parallel and Competing Work

Deep Research — DeepSeek-R1: Open-Source Reasoning Model (DeepSeek, January 2025)

  • arXiv:2501.12948
  • Independent verification of the self-evolution paradigm
  • Shows that the ideas work beyond just Microsoft
  • Key insight: open-source reasoning models can compete with proprietary o1

OpenAI o1 System Card (OpenAI, September 2024)

  • First public implementation of extended reasoning
  • Describes reasoning via chain-of-thought without exposing full methods
  • Reference point for comparing rStar-Math results

Anthropic Constitutional AI (Paper 22 in this series)

  • Sets the stage for feedback and training (earlier in the series)
  • Relevant context: how to train models using feedback signals

Technical Foundations

Monte Carlo Tree Search (Original: Kocsis & Szepesvári, 2006)

  • arXiv:cs/0611159
  • The foundational MCTS paper
  • Complex but worth reading for deep understanding of UCB and selection

UCB Bandit Algorithm (Auer et al., 2002)

AlphaGo Zero: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (Silver et al., 2017)

  • arXiv:1712.01724
  • Earlier example of MCTS + self-play in AI
  • Shows the power of bootstrapping without human data

Outcome Reward Models (ORMs) for Process Supervision (Openai, 2023)

  • Contrasts with PRMs
  • Useful for understanding the difference: outcome vs. process

Towards Measuring the Semantics of Language Models (Various papers)

  • Understanding what models learn from step-by-step data
  • Why training on reasoning traces is powerful

Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)

  • arXiv:2203.11171
  • Precursor to modern test-time compute (generates multiple paths, votes)

Mathematics Benchmarks and Datasets

MATH Dataset: A Large Scale Dataset for Benchmark Competition Level Mathematical Reasoning (Hendrycks et al., 2021)

  • arXiv:2103.15808
  • The benchmark used in rStar-Math
  • 12,500 competition-level problems
  • Standard evaluation for math reasoning

GSM8K: A Dataset for Solving Grade School Math (Cobbe et al., 2021)

  • arXiv:2110.14168
  • Simpler benchmark, good for early-stage model development
  • Good sanity check before tackling MATH

Measuring Math Problem Solving With the MATH Dataset (Hendrycks et al., 2023)

  • Extended analysis of the MATH benchmark
  • Difficulty tiers and problem type breakdowns

AIME and AMC Competitions

  • Official source: https://www.maa.org/
  • Competition problems are harder than MATH
  • Used as evaluation benchmarks for very strong models

Implementation Resources

Open-source rStar Implementations

  • Watch for official rStar code release (likely on GitHub)
  • Look for Microsoft Research Asia repositories
  • Community implementations will follow

MCTS Libraries in Python

  • mcts package (PyPI)
  • planner module in RL libraries
  • Reference implementations in game-playing AI

LLM APIs for Experimentation

  • OpenAI API (o1 preview for comparison)
  • Anthropic API (Claude)
  • Open-source models: Qwen, Llama, Mistral (HuggingFace)

Broader Context: The Reasoning Revolution

A Survey on Self-Evolving AI Systems

  • Emerging area of research
  • Examines bootstrapping and self-improvement mechanisms
  • Future direction for the field

Reasoning Models and Their Applications

  • How to use reasoning models in production systems
  • Latency-accuracy trade-offs
  • Cost considerations

Open Research Questions

After reading rStar-Math, consider exploring:

  1. Self-evolution beyond math: Can similar approaches work for code, science, logic puzzles?

  2. Better verifiers: Can you train PRMs more efficiently? Do weak PRMs degrade self-evolution?

  3. Scaling laws for self-evolution: How many rounds are needed for different model sizes? Is there a formula?

  4. Multi-task self-evolution: Can a single self-evolved model handle multiple domains (math + code + science)?

  5. Human-in-the-loop: What if humans provide weak feedback instead of automatic verification? How does this change the approach?

  6. Latency optimization: Can parallel MCTS reduce wall-clock time? How do you generate training data faster?

  7. Transfer learning: Does a self-evolved model on MATH transfer well to AIME or IMO problems?


Reinforcement Learning from Human Feedback (RLHF)

  • Used to align models with human preferences
  • Complementary to rStar-Math’s automatic verification approach

Curriculum Learning

  • Training on easy problems first, then hard ones
  • rStar-Math’s 4 rounds are a form of implicit curriculum

Active Learning

  • Selecting which examples to label / train on
  • MCTS naturally generates “hard examples” worth training on

Meta-Learning

  • Learning to learn across rounds
  • rStar-Math has a meta aspect: each round improves the learning process

Community and Discussions

OpenAI Research Blog: Updates on reasoning model developments

DeepSeek/Microsoft Research: Papers and technical reports on self-play and MCTS

Anthropic research: Constitutional AI and reasoning work

Twitter/X discussions: Real-time commentary from AI researchers on new papers

Alignment Research Center (ARC): Work on interpretability and process-based verification


Closing Message

You have finished the ainiketan.in paper series on AI reasoning. Starting from Turing’s 1950 question “Can machines think?” you traced the path through:

  • Chain-of-Thought (2022)
  • Verification (2023)
  • Test-Time Compute (2024)
  • Self-Evolution (2025)

The frontier is moving fast. By the time you read this, there will be new papers, new benchmarks, new methods. But the principles you’ve learned will persist:

Reason step-by-step. Verify each step. Allocate compute wisely. Learn from your own search.

These principles will guide the next generation of reasoning models.

Congratulations on completing the series. The field needs thoughtful practitioners who understand not just the latest method, but the underlying principles. That’s you.

Keep reading. Keep building. The frontier awaits.