Further Reading: RLHF and InstructGPT
Further Reading: RLHF and InstructGPT
Dive deeper into alignment, RLHF variants, and the products built on this paper.
The Original Paper
Training Language Models to Follow Instructions with Human Feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelley, Emma Coleman, Brennan Zoph, Amanda Askell, Solal Picciotto, Ariel Herbert-Voss, Jeff Engstrom, Christopher Olah, Gretchen Krueger, Ryan Felsher, Timothy Telleen-Lawton, Tom Conerly, Tamera Lanham, Karina Nguyen, Todd Henighan, Saurav Kadavath, Nick Joseph, Tom Brown, Jack Clark, Dawn Song, Dario Amodei, Ilya Sutskever, Paul Christiano, Sam Altman
NeurIPS 2022 | March 2022
The foundational paper. Introduces the three-stage RLHF pipeline, demonstrates alignment beats scale, and introduces InstructGPT. Essential reading for understanding all modern aligned LLMs.
Foundational Work on Preference Learning
Learning from Human Preferences: The Original Idea
Deep Reinforcement Learning from Human Preferences
Paul Christiano, Jan Leike, Tom Brown, Miljan Maretic, Shane Legg, Dario Amodei
ICML 2017 | June 2017
First paper to use RL with human feedback for training. Predates this paper by 5 years but uses the same core insight: humans can provide preference comparisons, and RL can optimize based on them. Much smaller scale (Atari games), but the conceptual foundation.
Why read it: Understand the original vision and see how the idea scaled from games to language models.
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stoyanov, Tom B. Brown, Alec Radford, Dario Amodei, Chris Olah
arXiv 2019 | September 2019
Earlier application of preference learning to language models (GPT-2). Smaller scale but demonstrates the concept works for text. This paper (InstructGPT) scales it dramatically.
Why read it: See the precursor; understand how the technique evolved.
Key Follow-Ups: Improving RLHF
Constitutional AI: AI Feedback Instead of Human Feedback
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Andy Jones, Sam McCandlish, Nikolai Occupied, Jared Kaplan, Jack Clark, Tom Brown
Anthropic | December 2023
Key innovation: Instead of humans rating outputs, use an LLM (GPT-3) to evaluate responses against a set of constitutional principles.
Why relevant: Addresses the scalability problem of RLHF (human annotation is expensive). CAI is 100× cheaper and enables training of Claude.
Results: Claude emerges as competitive with ChatGPT using AI feedback instead of human feedback.
Direct Preference Optimization (DPO)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Stanford | May 2023
Key innovation: Eliminate the separate reward model training stage. Train the policy directly on preference pairs using a closed-form objective.
Advantages:
- Simpler (2 stages instead of 3)
- More stable (fewer hyperparameters)
- Matches or exceeds RLHF performance
- Faster training
Why relevant: DPO is faster and simpler than RLHF while achieving comparable results. Many modern models use DPO instead of RLHF.
ORPO: Odds Ratio Preference Optimization
ORPO: Monolithic Preference Optimization without Reference Model
Hong Liu, Cahya Wirawan, Renren Jin, Bowen Zhang, Debing Zhang
arXiv | March 2024
Key innovation: Simplify DPO by removing the reference model entirely.
Why relevant: Even simpler than DPO, showing the direction of optimization: from complex pipelines (RLHF) to streamlined end-to-end methods.
Related Work on Alignment
On the Measurement and Control of Bias in Language Generation
On the Measurement and Mitigation of Unintended Bias in Text Generation
Su Lin Blodgett, Solon Barocas, Hal Daumé III, Suresh Venkatasubramanian
AIES 2020
Addresses bias in language models and measurement challenges in alignment.
Learning to Summarize with Human Feedback
Learning to summarize from human feedback
Nisan Stoyanov, Tom Brown, Bailey Pumperla, Ryan Lowe, Peter Welinder, Liane Lovitt, Liane Lovitt, Jack Clark, Sam McCandlish, Tom Henighan, Jared Kaplan, Chris Olah, Dario Amodei
OpenAI, NeurIPS 2020
Earlier OpenAI work applying preference learning to summarization (before InstructGPT). Shows the technique works for specific tasks.
Products and Deployments
ChatGPT: Bringing InstructGPT to Millions
ChatGPT (launched November 2022, 9 months after this paper) uses InstructGPT directly.
Resources:
Why relevant: See how the paper’s techniques became the world’s most popular AI product.
Claude: Anthropic’s RLHF Alternative
Claude uses Constitutional AI (RLAIF), a variant of RLHF that scales better.
Resources:
Why relevant: See how Constitutional AI improves on RLHF’s data cost problem.
GPT-4 with Improved RLHF
GPT-4 Technical Report
OpenAI | March 2023
Describes GPT-4’s training, including an improved RLHF pipeline. Shows iteration and refinement of the InstructGPT approach.
LLaMA-2-Chat: Open-Source RLHF
Llama 2: Open Foundation and Fine-Tuned Chat Models
Meta | July 2023
Demonstrates RLHF applied to open-source models. Includes details on data collection and alignment.
Why relevant: Shows RLHF is a general technique, not specific to OpenAI models.
Deeper Dives: Theory and Challenges
Reward Model Uncertainty and Distributional Shift
Reward Modeling for Faster Actual-Outcome Prediction in Reinforcement Learning
Daniel Dewey, et al.
Explores theoretical properties of reward models and distributional shift — a key challenge mentioned in this paper’s limitations.
Mechanistic Interpretability of Alignment
Interpretability in the Wild: Circuit Discovery, Reverse Engineering, and Distillation in the WILD
Various (MIRI, Anthropic, etc.)
Investigates how alignment objectives get encoded in neural networks.
Scaling Alignment
Beyond Preference Learning: Debate and Recursive Oversight
Scalable oversight of AI systems by humans using generative models
Paul Christiano, et al.
Explores how to extend preference learning to more complex forms of human feedback (debate, recursive oversight). Relevant for aligning more capable models.
Benchmarks and Evaluation
Towards Human-Level Performance on Automatic GLUE Score Prediction
Human-Level Performance in Large Language Models on Instruction-Following Tasks
Measures instruction-following quality (what InstructGPT improved).
TruthfulQA: Measuring Factuality in QA
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, Owain Evans
Center for AI Safety | September 2021
Benchmark for measuring truthfulness — one dimension of alignment.
Implementation and Tools
Hugging Face TRL: Text Reinforcement Learning
Production-ready library for RLHF, PPO, DPO, etc. Handles all the engineering complexity.
Why use it: If you’re implementing RLHF, TRL does the heavy lifting.
DeepSpeed-Chat: Scalable RLHF
Microsoft’s framework for distributed RLHF training. Handles multi-GPU/multi-node scaling.
TensorFlow RL Suite
Alternatives to PyTorch for RL implementation.
Safety and Alignment Research
Center for AI Safety
Ongoing research on scalable oversight, value learning, and alignment techniques.
Anthropic Research
Extensive research on Constitutional AI, interpretability, and alignment scaling.
Open Questions and Future Directions
What Remains Unsolved
- Scalable oversight: How do we stay in control of superhuman models?
- Value learning: Can models learn complex human values beyond preferences?
- Adversarial robustness: Can aligned models be tricked into misalignment?
- Multi-objective alignment: How do we balance safety with capability?
Papers on These Questions
The Alignment Problem: Machine Learning and Human Values
Brian Christian | 2020 | Book
Comprehensive overview of alignment challenges and solutions.
AI Safety and Reproducibility: Case Studies and Suggestions
Liane Lovitt, et al.
Recent work on reproducibility in alignment research.
Quick Reference: RLHF Evolution (2017–2025)
2017 Jun: Learning from Human Preferences (Christiano et al.)
↓
2019 Sep: Fine-Tuning Language Models from Human Preferences (Ziegler et al.)
↓
2020 Nov: Learning to Summarize from Human Feedback (Stoyanov et al.)
↓
2022 Mar: InstructGPT / RLHF (this paper) ← You are here
↓
2022 Nov: ChatGPT launches
↓
2023 Mar: GPT-4 with improved RLHF + Constitutional AI (Bai et al. concurrent)
↓
2023 May: DPO (Direct Preference Optimization) - simpler alternative
↓
2024 Mar: ORPO - even simpler
↓
2025+: Continued refinement and new approaches
Key Papers to Read in Order
- This paper: InstructGPT — Foundation
- Constitutional AI — Scalable feedback (RLAIF)
- DPO — Simpler pipeline
- ORPO — Further simplification
- ChatGPT Blog Post — Product deployment
Then, depending on interest:
- Alignment: Read CAIS and Anthropic papers on scalable oversight
- Safety: Read papers on adversarial robustness and value learning
- Implementation: Work through HuggingFace TRL tutorials
Resources for Learning
Free Courses
- Fast.ai: Practical Deep Learning — Includes RLHF chapters (newer versions)
- Stanford CS224N — NLP course, covers alignment
Blogs and Tutorials
- HuggingFace Blog: RLHF — Clear tutorial with code
- DeepLearning.AI: Short Course on RLHF — If available
Textbooks
- Reinforcement Learning: An Introduction — Sutton & Barto | RL fundamentals
Navigation: ← Back to Paper 15