Further Reading — Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling

Further Reading: Let’s Verify Step by Step

The Original Paper

Let’s Verify Step by Step (OpenAI, 2023)
Authors: Lightman, Kosaraju, Burda, Edwards, Baker, Lee, Leike, Schulman, Sutskever, Cobbe
arXiv: https://arxiv.org/abs/2305.20050
Venue: ICLR 2024

The foundational paper describing process reward models, the PRM800K dataset, and experimental results on the MATH benchmark. This is the core reference.

Paper 14: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
Link: Chain-of-Thought Prompting
Why read it: PRMs evaluate chain-of-thought steps. Understanding CoT is essential background for understanding what PRMs are evaluating.

Paper 15: Training Language Models to Follow Instructions from Human Feedback (Ouyang et al., 2022)
Link: InstructGPT / RLHF
Why read it: RLHF uses reward models. This paper explains how RLHF works and why better reward models (like PRMs) matter.

Paper 13: Training Compute-Optimal Large Language Models (Hoffman et al., 2022)
Link: Chinchilla Scaling Laws
Why read it: Process supervision is independent of scaling laws, but understanding how to allocate compute efficiently provides context for why PRMs matter (better use of test-time compute).

Immediate Follow-Ups and Applications

OpenAI o1: Reasoning Through Process Supervision (OpenAI, 2024)
https://openai.com/blog/o1-system-prompt/
Direct application of this paper. o1 uses a PRM-like reward model during training to improve mathematical reasoning and complex problem-solving. Achieves 96% accuracy on the MATH benchmark (vs. ~70% for earlier models).

AlphaProof: Solving IMO Problems with AI (DeepMind, 2024)
https://deepmind.google/research/publications/alphaproof/
Uses process-level supervision to train a system for formal mathematical proof. Solves problems from the International Mathematical Olympiad, the first AI system to do so. Demonstrates that process supervision scales to high-level mathematics.

Llama 2 Instruction Tuning and Process Reward Adaptation
Meta’s Llama 2 technical paper discusses extending reward modeling to open-source models. Process supervision ideas from this paper influenced how Llama 2 was trained and evaluated.

Techniques and Extensions

Reward Modeling and Training (Leike et al., 2018)
arXiv: https://arxiv.org/abs/1811.06521
Foundational work on reward models in the context of RLHF. Sets up the framework that this paper builds on. Good for understanding what makes reward models hard.

Scaling Laws for Reward Model Overoptimization (Gao et al., 2023)
arXiv: https://arxiv.org/abs/2210.10760
Addresses the problem of overfitting to reward models. Relevant because PRMs can be noisy too, and this paper discusses how to detect when a model is gaming the reward signal rather than actually improving.

Self-Critique with Rule-Based Feedback (Sap et al., 2022)
arXiv: https://arxiv.org/abs/2203.11171
Explores step-by-step feedback for reasoning tasks. Complements PRMs by showing how to give interpretable feedback at the step level, not just scores.

Datasets and Benchmarks

MATH Dataset (Hendrycks et al., 2021)
https://github.com/hendrycks/math
The benchmark used to evaluate PRMs in this paper. 500 challenging competition-level math problems. Essential for understanding the experimental setup.

PRM800K Dataset (OpenAI, 2023)
Released alongside the paper (with licensing restrictions).
The step-level annotation dataset derived from MATH. Available for research use. Starting point for PRM training for many follow-up projects.

Gemini 1.5 Reasoning Benchmark
Google’s benchmark for reasoning tasks, which uses process-level evaluation similar to this paper’s approach.

Code and Implementation Resources

OpenAI’s PRM Repository
GitHub: Search for “openai-prm” or visit OpenAI’s research repositories.
Reference implementation of PRM training on the MATH benchmark. Useful for understanding implementation details.

Hugging Face: Reward Modeling for LLMs
Multiple implementations of reward models (both outcome and process) available on Hugging Face.
https://huggingface.co/
Search for “reward model” or “mathematical reasoning” to find community implementations.

TransformerLens
A mechanistic interpretability library that can be used to understand what PRMs are learning about reasoning steps.
https://github.com/TransformerLensOrg/TransformerLens

Blog Posts and Discussions

“Process Supervision: Why Rewarding Steps Matters” (lesswrong.com, 2023)
Community discussion of the implications of process supervision for AI alignment and reasoning.

“OpenAI o1 and the Future of Reasoning Models” (Various AI blogs, 2024)
Multiple blog posts analyzing o1 and its connection to this paper’s insights.

“Step-by-Step Reasoning in Language Models” (Anthropic Blog)
Anthropic’s perspective on reasoning, process supervision, and alignment.

What to Read Next

Paper 17: LLaMA — Open and Efficient Foundation Language Models (Touvron et al., 2023)
Link: LLaMA
Why next: Introduces the architecture and training approach that process reward models are often applied to. Shows the broader context of scaling language models.

Paper 23: Scaling Test-Time Compute (Future Paper)
Expected topic: Using process supervision to efficiently allocate compute at inference time (best-of-N, beam search with step-level evaluation).
Why: Direct follow-up to this paper. Explores the implications of process supervision for inference-time efficiency.

Paper on Constitutional AI (Bai et al., 2022)
Explores step-level feedback from a different angle (principle-based rather than reward-model-based). Complements this paper’s approach.

Videos and Talks

OpenAI Research Talks (YouTube)
OpenAI researchers have given talks on reward modeling and process supervision. Search “OpenAI process reward” on YouTube.

DeepMind AlphaProof Talk (DeepMind YouTube Channel)
Discussion of how process supervision enables formal reasoning at scale.

Key Takeaways for Further Learning

Outcome vs. Process: The core insight (reward intermediate steps, not just final answers) applies beyond math. Look for other domains where this might help.
Generalization Questions: The field is still exploring whether PRMs work for reasoning outside math (medical, legal, commonsense reasoning). This is an open research area.
Data Efficiency: PRMs use richer annotations (step-level) but on fewer solutions. Compare this trade-off to other approaches (outcome supervision on more solutions, or few-shot learning).
Interpretability: PRMs can provide step-level feedback, which may be useful for interpretability and explanation (beyond just performance). This is relevant for AI safety and alignment.
Test-Time Scaling: This paper enables efficient test-time scaling. As compute becomes cheaper, best-of-N with PRMs becomes more attractive.

Discussion Questions for Study Groups

How would you extend process supervision to a domain like medical diagnosis? What are the obstacles?
Compare the data requirements: ORM needs N outcome labels; PRM needs ~T×N/k step-level labels (where T is avg steps, k is problems sampled). When is PRM more efficient?
Can PRMs be gamed? If a model learns to write steps that look correct but are subtly wrong, would a PRM catch it?
Why might process supervision be better for alignment? Does knowing the reasoning steps help detect harmful outputs?
What happens if you train a PRM on solutions from one domain (math) and test it on another (coding)? Does it generalize?