Upper Confidence Bound (UCB)

Appears in 1 paper

A formula that balances exploitation (choosing nodes with high average reward) and exploration (trying under-explored nodes).

As used in Paper 24 — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking →

A formula that balances exploitation (choosing nodes with high average reward) and exploration (trying under-explored nodes). UCB = average reward + C × exploration bonus. The exploration bonus decreases with more visits, so MCTS eventually settles on the best path.