Section 05

Worked Example: Step-by-Step Trace of ORM vs. PRM

Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling 2023

Let’s walk through a complete example from start to finish.

The Problem

Geometry problem from MATH benchmark:

A rectangle has width 4 and height 3. What is the length of the diagonal?

Ground truth: Using Pythagorean theorem: $d = \sqrt{4^2 + 3^2} = \sqrt{16 + 9} = \sqrt{25} = 5$.


Three Candidate Solutions Generated by GPT-4

The model is asked to solve this problem three times and produces three different solution attempts.

Solution 1: Clear and Correct

Step 1: A diagonal divides the rectangle into two right triangles.
Step 2: By Pythagorean theorem: d² = width² + height²
Step 3: d² = 4² + 3² = 16 + 9 = 25
Step 4: d = √25 = 5
Final Answer: 5

Solution 2: Correct but Verbose

Step 1: We need to find the diagonal of a rectangle with width=4 and height=3.
Step 2: The diagonal of a rectangle can be found using the Pythagorean theorem.
Step 3: Let me first check: is 4² + 3² = 16 + 9 = 25? Yes.
Step 4: So the diagonal is √25.
Step 5: Let me compute: √25 = 5. Let me verify: 5² = 25? Yes.
Step 6: Final answer is 5.
Final Answer: 5

Solution 3: Wrong Due to Arithmetic Error

Step 1: For a rectangle with width and height, the diagonal uses Pythagorean theorem.
Step 2: d² = 4² + 3² = 16 + 9 = 24  [ERROR: Should be 25]
Step 3: d = √24 ≈ 4.899
Step 4: Rounding to a nice number: d ≈ 5
Final Answer: 5 (approximately)

Human Annotation: Step-Level Labels

A human annotator goes through each solution and marks each step.

Solution 1 Annotations

StepContentCorrect?Confidence
1Diagonal divides rectangle into right triangles0.99
2Pythagorean theorem d² = w² + h²0.98
3Arithmetic: 16 + 9 = 250.99
4√25 = 50.99

Human notes: “Clear, direct, mathematically rigorous. No errors.”

Solution 2 Annotations

StepContentCorrect?Confidence
1Problem setup0.99
2Mentions Pythagorean theorem0.98
3Verification: 16 + 9 = 250.99
4d = √250.99
5Verification: 5² = 250.95
6Final answer stated0.99

Human notes: “All correct, but verbose and repetitive. More steps than necessary.”

Solution 3 Annotations

StepContentCorrect?Confidence
1Sets up Pythagorean theorem correctly0.98
2Arithmetic: 4² + 3² = 240.02
3d = √24 ≈ 4.8990.05
4Rounds to 50.10

Human notes: “Step 2 contains an arithmetic error (16 + 9 = 25, not 24). This cascades through the rest of the solution.”


Scoring with Outcome Reward Model (ORM)

The ORM only looks at the final answer.

SolutionFinal AnswerGround TruthORM Score
1551
2551
35 (approx)51

ORM verdict: All three solutions are equally good. ORM score = 1 for all.

Critique: The ORM cannot distinguish Solution 1 (clean, rigorous) from Solution 2 (verbose) from Solution 3 (contains an error that happened to round back to the correct answer). From the ORM’s perspective, they’re all identical.


Scoring with Process Reward Model (PRM)

The PRM looks at each step and multiplies the per-step correctness probabilities.

Solution 1: Product Score

Per-step probabilities: $p_1 = 0.99, p_2 = 0.98, p_3 = 0.99, p_4 = 0.99$

$$R_{\text{PRM}}(\text{Solution 1}) = 0.99 \times 0.98 \times 0.99 \times 0.99 = 0.950$$

Solution 2: Product Score

Per-step probabilities: $p_1 = 0.99, p_2 = 0.98, p_3 = 0.99, p_4 = 0.99, p_5 = 0.95, p_6 = 0.99$

$$R_{\text{PRM}}(\text{Solution 2}) = 0.99 \times 0.98 \times 0.99 \times 0.99 \times 0.95 \times 0.99 = 0.893$$

Note: Solution 2’s score is lower than Solution 1 because it has more steps (6 vs. 4), and each extra step introduces a chance of error. The extra Step 5 (verification) is correct, but it’s still an additional opportunity for error, which reduces the product.

Solution 3: Product Score

Per-step probabilities: $p_1 = 0.98, p_2 = 0.02, p_3 = 0.05, p_4 = 0.10$

$$R_{\text{PRM}}(\text{Solution 3}) = 0.98 \times 0.02 \times 0.05 \times 0.10 = 0.000098$$

The score is tiny because Step 2 is marked incorrect (p₂ = 0.02), and multiplying by 0.02 makes the entire product collapse.


Summary: PRM Scores

SolutionORM ScorePRM ScoreRanking
Solution 11.0000.9501st (best)
Solution 21.0000.8932nd
Solution 31.0000.0000983rd (worst)

Best-of-N Selection Results

Now suppose we use each reward model to pick the best solution from these 3 candidates.

With ORM:

Scores: [1, 1, 1]
Best solution: Solution 1 (tie-break: first in list)
Verdict: ORM picked a correct solution, but so would solutions 2 or 3.
        ORM got lucky. It cannot explain why solution 1 is better.

With PRM:

Scores: [0.950, 0.893, 0.000098]
Best solution: Solution 1
Verdict: PRM clearly ranks solution 1 as best. It also distinguishes
         solution 2 (verbose but correct) from solution 3 (contains error).

The Deeper Insight

In this toy example, ORM and PRM both happen to pick Solution 1. But PRM’s choice is justified: Solution 1 is genuinely the best (cleanest reasoning, no errors, no verbosity). PRM “knows” this because it looked at the steps.

ORM’s choice is unjustified: ORM cannot explain why Solution 1 is better than Solutions 2 or 3 — it only knows they all have the right final answer.

Now scale this up: imagine 100 solutions. ORM will have many ties (many solutions with correct final answers). PRM will rank them by quality of reasoning. This is why PRM is more useful for training and for selecting among candidates.


Verification: Manual Check of All Arithmetic

Solution 1, Step 3:

  • 4² = 4 × 4 = 16 ✓
  • 3² = 3 × 3 = 9 ✓
  • 16 + 9 = 25 ✓

Solution 2, Step 3:

  • Same as above: 16 + 9 = 25 ✓

Solution 3, Step 2 (ERROR):

  • 4² = 16 ✓
  • 3² = 9 ✓
  • 16 + 9 = 25, NOT 24 ✗

Solution 3, Step 3:

  • √24 ≈ 4.899 ✓ (given the wrong input 24)
  • But the input is wrong, so this step is marked incorrect