Policy Model

Appears in 1 paper

The language model being trained and improved across rounds.

As used in Paper 24 — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking →

The language model being trained and improved across rounds. Starts at 42% accuracy, improves to 90% through self-evolution.

Paper 24 — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking →