Reinforcement Learning from Human Feedback (RLHF)
A three-stage training pipeline for aligning language models: (1) Supervised Fine-Tuning on human demonstrations, (2) training a Reward Model on human preference comparisons, (3) using Reinforcement Learning (PPO) to optimize the policy aga
A three-stage training pipeline for aligning language models: (1) Supervised Fine-Tuning on human demonstrations, (2) training a Reward Model on human preference comparisons, (3) using Reinforcement Learning (PPO) to optimize the policy against the reward model. This paper is the seminal work on RLHF at scale.