The insight is simple but powerful: Grade the steps, not just the final answer.
What Is Process Supervision?
A process reward model (PRM) is trained to score individual steps in a multi-step solution, not the entire solution at once.
Here’s the workflow:
-
Generate many solutions to a problem using a base LLM (e.g., GPT-4). Each solution is a chain of reasoning steps.
-
Have human annotators mark each step. For each step in each solution, they answer: “Is this step mathematically correct, given the steps before it?” The answer is binary: yes (1) or no (0).
-
Aggregate into a dataset. Collect these step-level labels. This becomes PRM800K — 800,000 labeled steps across ~8,000 solutions.
-
Train a neural network (the PRM) to predict step correctness. The PRM learns to look at a partial solution (steps 1 through i) and predict whether step i+1 is correct.
-
At inference, use the PRM for best-of-N selection. Generate N candidate solutions. For each solution, run it through the PRM to get per-step scores. Combine these scores (multiply them, take the minimum, or average) to get an overall solution score. Pick the solution with the highest score.
The PRM Architecture
In practice, the PRM is a transformer-based language model (similar to GPT-4 but smaller). It takes as input:
- The original problem
- The solution steps generated so far
- The candidate next step
And outputs a probability: “How likely is this step correct?”
This is trained like a binary classification task (correct/incorrect step) using standard ML loss functions, like cross-entropy loss.
Outcome Supervision vs. Process Supervision: Side by Side
| Aspect | Outcome Supervision (ORM) | Process Supervision (PRM) |
|---|---|---|
| What gets labeled | Final answer only | Every intermediate step |
| Label type | Binary: correct or incorrect | Binary per step: correct or incorrect |
| Data source | ”Is the final answer right?" | "Is each step correct?” |
| Information per solution | 1 bit | T bits (T = number of steps) |
| Noise in labels | High: final answer can be right by coincidence | Lower: step-level errors are exposed |
| Computational cost to train | Low: fewer labels needed | Higher: more labels needed |
| Quality of trained model | Noisy, easily fooled | More robust, detects reasoning errors |
| Typical annotation time | ~1 minute per solution | ~5-10 minutes per solution (but richer signal) |
The Key Insight
Here’s the core intuition: Humans naturally read through solutions step by step. A human evaluator doesn’t teleport to the final answer — they follow the reasoning from the top. When they read step 3, they’re already thinking “is this consistent with steps 1 and 2?”
Outcome supervision throws away this rich human understanding and keeps only the final verdict. Process supervision captures what the human actually sees.
Moreover, step-level signals are more discriminative. Consider two solutions that both arrive at the wrong final answer:
- Solution A: steps 1, 2, 3, 4 are all correct, but step 5 is wrong → ORM: 0 (wrong answer)
- Solution B: steps 1, 2 are correct, steps 3, 4, 5 are all wrong → ORM: 0 (wrong answer)
Both get the same ORM score, but they’re very different in quality. Solution A is 80% correct reasoning; Solution B is 40%. A PRM can distinguish them: Solution A gets 4/5 steps correct; Solution B gets 2/5. This distinction matters if you later want to improve the model.
Indian Analogy: The Teacher vs. The Exam
Imagine two ways of teaching in a high school:
Outcome Supervision (ORM): The teacher gives a final exam. Students solve 10 problems. At the end of the day, the teacher only looks at the answer key: “Student A got 7 answers right, Student B got 7 answers right.” Both get the same grade. But maybe Student A understood calculus perfectly and made one arithmetic error; maybe Student B guessed randomly on 7 problems and happened to be right.
Process Supervision (PRM): The teacher assigns the same 10 problems but asks students to show all working. The teacher reads through each student’s work step by step. “Student A: problem 1, steps are correct; problem 2, step 2 is wrong; problem 3, all steps correct…” The teacher builds a detailed picture of each student’s understanding.
Later, if a student comes to the teacher saying “I don’t understand integrations,” the PRM teacher can review: “In problems 5, 6, 7, your integration steps were correct, but in problem 8, you made a sign error. Let me explain…” The ORM teacher can only say “You got 7 right and 3 wrong. Study harder.”
The PRM teacher is more useful for learning. Similarly, a PRM is more useful for training LLMs.
The PRM800K Dataset
To validate this idea, OpenAI created a massive dataset:
- ~8,000 math problem solutions generated by GPT-4
- Problem source: MATH benchmark (500 competition-level math problems from AMC, AIME, MATHCOUNTS, etc.)
- Solutions per problem: ~16 solutions sampled from GPT-4
- Annotation: For each solution, human annotators marked each step as correct or incorrect
- Total: ~800,000 step-level labels
This was a massive effort, but it proved the concept: step-level supervision at scale is feasible and useful.