Section 04

The Mathematics Behind MCTS and Self-Evolution

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking 2025

Prerequisites

Read first: Paper 23 (Test-Time Compute), which explains Best-of-N and the basic idea of test-time compute. Also: Upper Confidence Bound.


UCB Formula for MCTS Node Selection

In Monte Carlo Tree Search, at each step you must decide: which partial solution should I explore next?

The Upper Confidence Bound (UCB) formula balances two goals:

  • Exploitation: Go back to branches that have worked well (high average reward)
  • Exploration: Try branches that haven’t been visited much (might be better!)

The formula is:

$$\text{UCB}(v) = \frac{Q(v)}{N(v)} + C \sqrt{\frac{\ln(N(\text{parent}))}{N(v)}}$$

Where:

  • $Q(v)$ = total reward accumulated at node $v$ (sum of all rollout rewards)
  • $N(v)$ = number of times node $v$ has been visited
  • $N(\text{parent})$ = number of times the parent node has been visited
  • $C$ = exploration constant (typically $\sqrt{2} \approx 1.414$)

Interpretation:

  • First term $\frac{Q(v)}{N(v)}$ = average reward at this node. Exploitation: favor high-reward branches
  • Second term $C \sqrt{\frac{\ln(N(\text{parent}))}{N(v)}}$ = exploration bonus. Penalizes nodes with many visits, rewards nodes with few visits

Worked Example: UCB in Action

Scenario: A MCTS search for a math problem. The parent node has been visited $N(\text{parent}) = 30$ times. Three child nodes represent different solution strategies:

Node A (cautious approach):

  • $Q(A) = 8.4, N(A) = 12$
  • $\text{UCB}(A) = \frac{8.4}{12} + 1.414 \times \sqrt{\frac{\ln(30)}{12}}$
  • $= 0.700 + 1.414 \times \sqrt{\frac{3.401}{12}}$
  • $= 0.700 + 1.414 \times \sqrt{0.283}$
  • $= 0.700 + 1.414 \times 0.532$
  • $= 0.700 + 0.752 = \mathbf{1.452}$

Node B (moderate approach):

  • $Q(B) = 2.1, N(B) = 3$
  • $\text{UCB}(B) = \frac{2.1}{3} + 1.414 \times \sqrt{\frac{\ln(30)}{3}}$
  • $= 0.700 + 1.414 \times \sqrt{\frac{3.401}{3}}$
  • $= 0.700 + 1.414 \times \sqrt{1.134}$
  • $= 0.700 + 1.414 \times 1.065$
  • $= 0.700 + 1.507 = \mathbf{2.207}$

Node C (bold approach):

  • $Q(C) = 0.5, N(C) = 1$
  • $\text{UCB}(C) = \frac{0.5}{1} + 1.414 \times \sqrt{\frac{\ln(30)}{1}}$
  • $= 0.500 + 1.414 \times \sqrt{3.401}$
  • $= 0.500 + 1.414 \times 1.844$
  • $= 0.500 + 2.608 = \mathbf{3.108}$

MCTS selects Node C (highest UCB = 3.108).

Why? Even though Node A has the highest average reward (8.4/12 = 0.70), it’s been heavily explored. Node C barely touched (1 visit) deserves more attention. The exploration bonus for C overwhelms its lower average reward.

Key insight: MCTS prevents getting stuck in local optima by forcing exploration of under-visited branches.


PRM Step Scoring

The Process Reward Model outputs a score $p_i \in [0, 1]$ for each solution step $i$.

Step-level accuracy: How likely is step $i$ to be on the path to the correct answer?

The overall solution score can be computed as:

Product score: $$\text{score}{\text{product}} = \prod{i=1}^{n} p_i$$

This penalizes heavily if any step is weak. Example: if steps are [0.90, 0.85, 0.92], product score = 0.706.

Minimum score: $$\text{score}_{\text{min}} = \min_i p_i$$

This identifies the weakest link. Example: min([0.90, 0.85, 0.92]) = 0.85.

rStar-Math uses these scores to filter solutions in the MCTS rollout:

  • Prune branches where $\text{score}_{\text{min}} < 0.5$ (weakest step is unreliable)
  • Expand branches where $\text{score}_{\text{min}} > 0.8$ (all steps look good)

Round-by-Round Improvement Analysis

The paper reports accuracy across 4 rounds:

RoundAccuracyImprovementImprovement %
0 (base)42.0%
168.4%+26.4 pp+62.9% relative
278.2%+9.8 pp+14.3% relative
385.3%+7.1 pp+9.1% relative
490.0%+4.7 pp+5.5% relative

Observation: Diminishing Returns

The improvement per round is decreasing: 26.4 → 9.8 → 7.1 → 4.7. This is characteristic of a bootstrapping process hitting saturation.

Analysis:

If we model the ceiling at ~92% (frontier performance) and assume geometric decay, the remaining improvement is:

  • Round 0 remaining: 92% - 42% = 50 pp
  • Round 1 remaining: 92% - 68.4% = 23.6 pp (reduction factor ≈ 0.47)
  • Round 2 remaining: 92% - 78.2% = 13.8 pp (reduction factor ≈ 0.58)
  • Round 3 remaining: 92% - 85.3% = 6.7 pp (reduction factor ≈ 0.49)
  • Round 4 remaining: 92% - 90% = 2 pp

The reduction factor is roughly constant (~0.5), suggesting an exponential decay:

$$\text{Accuracy}_{\text{round } n} = 92 - 50 \times 0.5^n$$

Verify:

  • Round 0: 92 - 50 = 42 ✓
  • Round 1: 92 - 25 = 67 (actual: 68.4, close!)
  • Round 2: 92 - 12.5 = 79.5 (actual: 78.2, close!)
  • Round 4: 92 - 3.125 = 88.875 (actual: 90, reasonable)

Why diminishing returns?

  • Early rounds: the model and PRM both improve dramatically
  • Late rounds: MCTS has already found most of the good solutions in the dataset; further search yields redundant data
  • Model capacity: a 7B model can’t reason perfectly; there’s a ceiling

Supervised Fine-Tuning (SFT) Loss

For each solution collected in the MCTS phase, we fine-tune the model using standard cross-entropy loss:

$$L_{\text{SFT}} = -\frac{1}{T} \sum_{t=1}^{T} \log P(\text{token}t | \text{tokens}{<t}, \text{problem})$$

Where:

  • $T$ = total number of tokens in the solution
  • $\text{token}_t$ = the $t$-th token in the correct solution
  • $P(\text{token}_t | \ldots)$ = probability the model assigns to that token

Key: We only train on tokens from solutions that:

  1. Executed correctly (Python verification passed)
  2. Had high PRM scores (good reasoning quality)

This biases the model toward high-quality solutions.


Comparison to Best-of-N (Paper 23)

Recall from Paper 23: $P(\text{success with } N \text{ samples}) = 1 - (1-p)^N$

Best-of-N approach:

  • Generate N=10 solutions from 42%-accurate base model
  • Expected successes in N samples: $1 - 0.58^{10} \approx 99.8%$… sounds great!
  • But: need a reliable PRM to pick the right one; many solutions are low-quality reasoning

MCTS approach:

  • Generate fewer total samples, but guided by MCTS
  • All selected solutions have high reasoning quality
  • Training on these solutions directly improves the base model
  • Result: model becomes 68% accurate on its own (doesn’t need test-time compute anymore)

rStar-Math is more ambitious: it’s not content to just improve inference accuracy; it improves the base model itself.


Why Four Rounds?

Why not 10 rounds? Or 2?

The paper chose 4 because:

  • Round 1: Biggest improvements (data quality jump is largest)
  • Rounds 2–3: Good improvements with reasonable compute
  • Round 4: Hits diminishing returns; Round 5 would be minimal additional gain
  • Computational budget: Each round requires running MCTS on 12,500 problems, which is expensive

A practical tradeoff: 4 rounds gives 90% accuracy with reasonable resources.


Parameter Counts and Compute

  • Qwen-7B: 7 billion parameters
  • PRM training: ~100–200 million parameters (much smaller)
  • MCTS compute: Running MCTS search on 12,500 problems is the main bottleneck

Total training time: Estimated 1–2 GPU-months of compute per round, so 4–8 GPU-months total. For comparison, training a model from scratch is 100s of GPU-months.


Convergence and Stopping

The self-evolution loop stops when:

  1. Accuracy plateaus: Improvement per round falls below a threshold
  2. Compute budget exhausted: Resources run out
  3. PRM quality degrades: If PRM becomes less reliable, it might select worse solutions, and performance drops

In practice, 4 rounds is empirically optimal for the MATH dataset.