Outcome Reward Model (ORM)

Appears in 2 papers

A model that scores only the final output (right or wrong), without evaluating intermediate steps.

As used in Paper 23 — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters →

A model that scores only the final output (right or wrong), without evaluating intermediate steps. Less informative than a PRM, but simpler to train.

As used in Paper 24 — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking →

A model that scores only the final output (right or wrong), not intermediate steps. Less informative than PRM but simpler to train.