One-Sentence Summary
Reward each reasoning step, not just the final answer, and your AI will learn to think better.
The Full Summary
Problem
Large language models (LLMs) generate multiple candidate solutions to problems, but the standard way to pick the best one — outcome supervision (judge only if the final answer is right) — is noisy. A solution can have the wrong reasoning but the right answer (lucky guess), or the right reasoning but a small error in the final answer. Outcome models cannot distinguish these cases, so they reward luck rather than thinking.
Idea
Instead of judging only the final answer, judge each step of the reasoning independently. Train a process reward model (PRM) on step-level labels collected from humans. A PRM learns to spot which steps are correct, which are wrong, and which reveal faulty reasoning. At inference, use the PRM to rank candidate solutions by the quality of their reasoning, not just the correctness of the answer.
Key Numbers
- 800,000 step-level annotations across 8,000 solutions (PRM800K dataset)
- 500 math problems from the MATH benchmark used for evaluation
- 4-6 steps per solution on average
- ~95% accuracy improvement in best-of-N selection (PRM outperforms ORM)
Indian Analogy
A teacher grading homework two ways: (1) only check if the final answer is right or wrong (ORM), or (2) check each line of working to see if the reasoning is sound (PRM). The second teacher understands the student better and can give better feedback.
The Formula
ORM: R = 1 if answer correct, else 0 [binary, outcome only]
PRM: R = p₁ × p₂ × ... × p_T [product of per-step probabilities]
Where p_i is the probability that step i is correct.
What Comes Next
This technique directly influenced OpenAI o1 (which uses process supervision), AlphaProof (which evaluates proof steps), and the broader “test-time compute” paradigm (Paper 23) — where you allocate more computation at inference to pick better solutions.
Read More
- Next paper: Paper 17: LLaMA — Open-Source Foundation Models
- Previous paper: Paper 15: Training Language Models to Follow Instructions from Human Feedback (InstructGPT)
- Related: Paper 14: Chain-of-Thought Prompting — CoT creates the steps that PRMs evaluate
- Related: Paper 23: Scaling Test-Time Compute — builds on process supervision to scale reasoning
Key Takeaway
Process supervision is the right way to reward reasoning. When you want an AI to think well, don’t reward it only for the final answer. Reward the intermediate steps. This is how humans learn, and it’s how AIs should be trained.