Summary: The One-Sentence Version — Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling

One-Sentence Summary

Reward each reasoning step, not just the final answer, and your AI will learn to think better.

The Full Summary

Problem

Large language models (LLMs) generate multiple candidate solutions to problems, but the standard way to pick the best one — outcome supervision (judge only if the final answer is right) — is noisy. A solution can have the wrong reasoning but the right answer (lucky guess), or the right reasoning but a small error in the final answer. Outcome models cannot distinguish these cases, so they reward luck rather than thinking.

Idea

Instead of judging only the final answer, judge each step of the reasoning independently. Train a process reward model (PRM) on step-level labels collected from humans. A PRM learns to spot which steps are correct, which are wrong, and which reveal faulty reasoning. At inference, use the PRM to rank candidate solutions by the quality of their reasoning, not just the correctness of the answer.

Key Numbers

800,000 step-level annotations across 8,000 solutions (PRM800K dataset)
500 math problems from the MATH benchmark used for evaluation
4-6 steps per solution on average
~95% accuracy improvement in best-of-N selection (PRM outperforms ORM)

Indian Analogy

A teacher grading homework two ways: (1) only check if the final answer is right or wrong (ORM), or (2) check each line of working to see if the reasoning is sound (PRM). The second teacher understands the student better and can give better feedback.

The Formula

ORM: R = 1 if answer correct, else 0  [binary, outcome only]

PRM: R = p₁ × p₂ × ... × p_T  [product of per-step probabilities]

Where p_i is the probability that step i is correct.

What Comes Next

This technique directly influenced OpenAI o1 (which uses process supervision), AlphaProof (which evaluates proof steps), and the broader “test-time compute” paradigm (Paper 23) — where you allocate more computation at inference to pick better solutions.

Next paper: Paper 17: LLaMA — Open-Source Foundation Models
Previous paper: Paper 15: Training Language Models to Follow Instructions from Human Feedback (InstructGPT)
Related: Paper 14: Chain-of-Thought Prompting — CoT creates the steps that PRMs evaluate
Related: Paper 23: Scaling Test-Time Compute — builds on process supervision to scale reasoning

Key Takeaway

Process supervision is the right way to reward reasoning. When you want an AI to think well, don’t reward it only for the final answer. Reward the intermediate steps. This is how humans learn, and it’s how AIs should be trained.