Section 04

The Math: Formalizing ORM and PRM

Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling 2023

Prerequisite Tutorials


Outcome Reward Model (ORM)

The simplest reward model just judges the final answer:

$$R_{\text{ORM}}(x, y) = \begin{cases} 1 & \text{if } \text{final_answer}(y) = \text{correct answer} \ 0 & \text{otherwise} \end{cases}$$

Where:

  • $x$ is the problem (e.g., “Solve $2x + 3 = 11$”)
  • $y$ is the full solution (all steps + final answer)
  • The output is binary: 1 for correct, 0 for incorrect

Training the ORM: Given a dataset of problems $x_i$ and solutions $y_i$ with binary labels $\ell_i \in {0, 1}$ (is the solution correct?), we train a neural network $f_{\theta}$ to predict the correctness:

$$\mathcal{L}{\text{ORM}} = -\sum_i \left[ \ell_i \log f{\theta}(x_i, y_i) + (1 - \ell_i) \log(1 - f_{\theta}(x_i, y_i)) \right]$$

This is binary cross-entropy loss — standard binary classification.

Process Reward Model (PRM)

Now, instead of judging the full solution, we judge each step:

$$R_{\text{PRM}}(x, y) = \prod_{t=1}^{T} p_t$$

Where:

  • $y = (s_1, s_2, \ldots, s_T)$ is a solution decomposed into $T$ steps
  • $p_t = P(\text{step}t \text{ is correct} \mid x, s{1:t-1})$ is the probability that step $t$ is correct, given the problem and the steps before it
  • The overall score is the product of per-step correctness probabilities

Intuitively: if a solution has steps that are 95%, 88%, 72%, and 91% likely to be correct, the overall solution score is $0.95 \times 0.88 \times 0.72 \times 0.91 = 0.547$ (about 55% likely to be fully correct).

Alternative formulation (minimum score): Sometimes papers use:

$$R_{\text{PRM}}(x, y) = \min(p_1, p_2, \ldots, p_T)$$

The intuition: the solution is only as strong as its weakest step. If even one step is wrong, the whole solution fails.

Training the PRM: Given step-level labels, we train a neural network $g_{\theta}$ to predict step correctness:

$$\mathcal{L}{\text{PRM}} = -\sum_i \sum_t \left[ \ell{i,t} \log g_{\theta}(x_i, s_{i,1:t}) + (1 - \ell_{i,t}) \log(1 - g_{\theta}(x_i, s_{i,1:t})) \right]$$

Where $\ell_{i,t} \in {0, 1}$ is the label for step $t$ in solution $i$.

This is also binary cross-entropy, but applied to each step independently.

Best-of-N Selection

After training a reward model (ORM or PRM), we use it to select the best solution from $N$ candidates:

$$y^* = \arg\max_{i \in 1..N} R(x, y_i)$$

In other words:

  1. Sample $N$ solutions $y_1, y_2, \ldots, y_N$ from the LLM for problem $x$
  2. Score each solution using the reward model: $R(x, y_1), R(x, y_2), \ldots, R(x, y_N)$
  3. Return the solution with the highest score

Why this matters: The quality of best-of-N selection depends on the quality of the reward model. If the reward model is noisy (ORM), it picks the wrong solution. If it’s accurate (PRM), it consistently picks good solutions.


Worked Example 1: Computing PRM Scores

Problem: Solve $3x - 5 = 10$.

Solution:

Step 1: Add 5 to both sides: 3x = 15
Step 2: Divide by 3: x = 5
Step 3: Check: 3(5) - 5 = 15 - 5 = 10. Correct.

Human annotations: Each step is marked correct (1) or incorrect (0).

  • Step 1: Correct. PRM learns $p_1 = 0.98$ (high confidence)
  • Step 2: Correct. PRM learns $p_2 = 0.95$ (high confidence)
  • Step 3: Correct. PRM learns $p_3 = 0.92$ (high confidence)

PRM product score: $$R_{\text{PRM}} = p_1 \times p_2 \times p_3 = 0.98 \times 0.95 \times 0.92 = 0.8554$$

Minimum score: $$R_{\text{PRM (min)}} = \min(p_1, p_2, p_3) = \min(0.98, 0.95, 0.92) = 0.92$$

ORM score: $$R_{\text{ORM}} = 1 \quad \text{(final answer is correct)}$$


Worked Example 2: Comparing ORM and PRM on Different Solutions

Problem: Integrate $2x$ from $0$ to $2$.

Solution A (Good reasoning, right answer):

Step 1: Antiderivative of 2x is x²: [CORRECT]
Step 2: Evaluate at x=2: 2² = 4: [CORRECT]
Step 3: Evaluate at x=0: 0² = 0: [CORRECT]
Step 4: Result: 4 - 0 = 4: [CORRECT]

Per-step scores from PRM: $p_1 = 0.96, p_2 = 0.97, p_3 = 0.99, p_4 = 0.98$

$$R_{\text{PRM}}(A) = 0.96 \times 0.97 \times 0.99 \times 0.98 = 0.902$$ $$R_{\text{ORM}}(A) = 1 \quad \text{(final answer 4 is correct)}$$

Solution B (One arithmetic error, wrong final answer):

Step 1: Antiderivative of 2x is x²: [CORRECT]
Step 2: Evaluate at x=2: 2² = 4: [CORRECT]
Step 3: Evaluate at x=0: 0² = 0: [CORRECT]
Step 4: Result: 4 - 1 = 3: [WRONG — should be 4 - 0 = 4]

Per-step scores from PRM: $p_1 = 0.96, p_2 = 0.97, p_3 = 0.99, p_4 = 0.05$ (step 4 is wrong)

$$R_{\text{PRM}}(B) = 0.96 \times 0.97 \times 0.99 \times 0.05 = 0.046$$ $$R_{\text{ORM}}(B) = 0 \quad \text{(final answer 3 is incorrect)}$$

Analysis:

  • ORM: Both solutions get different scores (1 vs. 0), so ORM correctly ranks A > B.
  • PRM: PRM also ranks A > B, but with more nuance. PRM sees that A is 90% correct overall, while B is only 4.6% correct overall. Moreover, PRM pinpoints that Step 4 is the problem in B (score 0.05), which ORM cannot do.

Worked Example 3: The Case Where ORM Misleads

Problem: Solve $x^2 - 4 = 0$.

Solution C (Right answer, but lucky reasoning):

Step 1: x² = 4: [CORRECT]
Step 2: x = 2: [INCOMPLETE — should also mention x = -2, but it is mathematically a valid step]
Step 3: Verify: 2² - 4 = 0. Correct.: [CORRECT but incomplete — didn't check x = -2]

Per-step scores: $p_1 = 0.98, p_2 = 0.60$ (Step 2 is incomplete), $p_3 = 0.70$ (Step 3 is incomplete)

$$R_{\text{PRM}}(C) = 0.98 \times 0.60 \times 0.70 = 0.4116$$ $$R_{\text{ORM}}(C) = 1 \quad \text{(if you only mark “x=2” as the answer, it’s partially correct, but let’s say ORM gives 1 for “correct enough”)}$$

Solution D (Complete, rigorous reasoning, but longer):

Step 1: x² - 4 = 0 can be factored as (x-2)(x+2) = 0: [CORRECT]
Step 2: So x = 2 or x = -2: [CORRECT]
Step 3: Verify x = 2: 2² - 4 = 0. Correct.: [CORRECT]
Step 4: Verify x = -2: (-2)² - 4 = 0. Correct.: [CORRECT]

Per-step scores: $p_1 = 0.99, p_2 = 0.98, p_3 = 0.99, p_4 = 0.99$

$$R_{\text{PRM}}(D) = 0.99 \times 0.98 \times 0.99 \times 0.99 = 0.950$$ $$R_{\text{ORM}}(D) = 1 \quad \text{(correct final answer)}$$

The insight: ORM ranks C and D equally (both 1). But PRM strongly prefers D (0.950 vs. 0.4116), because D is more rigorous and complete. If you’re training an LLM to be mathematically precise, you want to reward D, not C.


Summary: Why Product Matters

The product formulation $R_{\text{PRM}} = \prod_t p_t$ has an important property:

  • If any step has low probability $p_t \approx 0$, the product becomes small
  • This makes solutions with even one weak step get lower scores
  • This discourages “1 error ruins everything” scenarios

The minimum formulation $R_{\text{PRM}} = \min_t p_t$ is more conservative:

  • The overall score is capped by the weakest step
  • If steps 1, 2, 4 are perfect but step 3 is 50% likely correct, the score is 0.50
  • This directly reflects that the solution is only as strong as its weakest link

In practice, the paper uses both, but the product score is more mathematically standard for probabilities.