Reinforcement Learning from Human Feedback (RLHF)

Appears in 1 paper

A three-stage training pipeline for aligning language models: (1) Supervised Fine-Tuning on human demonstrations, (2) training a Reward Model on human preference comparisons, (3) using Reinforcement Learning (PPO) to optimize the policy aga

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

A three-stage training pipeline for aligning language models: (1) Supervised Fine-Tuning on human demonstrations, (2) training a Reward Model on human preference comparisons, (3) using Reinforcement Learning (PPO) to optimize the policy against the reward model. This paper is the seminal work on RLHF at scale.

Paper 15 — Training Language Models to Follow Instructions with Human Feedback →