The Idea: Three-Stage Alignment with Human Feedback
Overview: The RLHF Pipeline
The paper’s core innovation is a three-stage pipeline:
- Supervised Fine-Tuning (SFT): Learn from human examples
- Reward Model (RM): Learn what humans prefer
- Reinforcement Learning (RL): Optimize using the reward model
Each stage solves a different sub-problem of alignment.
Stage 1: Supervised Fine-Tuning (SFT)
What It Does
Take GPT-3 (trained on internet text) and fine-tune it on human-written examples of good behavior.
Example:
Prompt: "Explain photosynthesis for a 10-year-old."
Human-written ideal response:
"Plants are like tiny factories. They take in sunlight, water, and air,
and make food for themselves. It's like they eat sunshine! The green
part of the leaf (called chlorophyll) catches the sunlight and
uses the energy to combine water and air into sugar, which the plant
uses to grow. Oxygen (the air we breathe) is a leftover product!"
We fine-tune GPT-3 on thousands of such (prompt, ideal_response) pairs.
Why This Stage Matters
- Quick start: GPT-3 already knows a lot; we’re just steering it toward better behavior
- Cheap: Fine-tuning is much cheaper than retraining from scratch
- Bootstrapping: The RM and RL stages need decent starting point from SFT
The Limitation
We can’t write enough examples to cover all cases. If we fine-tune SFT model for too long, it overfits to the specific examples and loses generality.
Solution: Use SFT as a warm start. Then train the reward model to generalize preferences.
Stage 2: Reward Model (RM): Learning Human Preferences
What It Does
Train a classifier that predicts: “Which of two outputs is better?”
Example:
Prompt: "What is 2+2?"
Output A: "2+2=4"
Output B: "2+2=5"
Human labels: A is better (obviously).
Reward Model learns: r(A) > r(B)
But most cases are subtler:
Prompt: "Tell me a joke."
Output A: "Why did the chicken cross the road?
To get to the other side. It's a classic!"
Output B: "A man walks into a bar. Ouch!"
Human rater: Hmm, A is more complete but B is more original.
A might get labeled as preferred 60% of the time.
Reward Model learns to assign higher scores to A.
How It Works: Bradley-Terry Model
The reward model doesn’t output a real-valued score directly. Instead, it outputs a “logit” (log-odds), and we use the Bradley-Terry model:
For a pair (y_A, y_B) where A is preferred:
$$P(A \text{ preferred} | y_A, y_B) = \sigma(r_\theta(x, y_A) - r_\theta(x, y_B))$$
Where $\sigma$ is the sigmoid function:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Loss function:
$$L = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$
Where $y_w$ is the preferred response and $y_l$ is the less-preferred response.
Interpretation:
- If $r(y_w) \gg r(y_l)$, the sigmoid is close to 1, and log(1) = 0 — low loss (good prediction)
- If $r(y_w) < r(y_l)$, the sigmoid is close to 0, and log(0) → ∞ — high loss (bad prediction)
Why Bradley-Terry?
This model is from ranking theory. It’s elegant:
- It doesn’t require absolute scores (which would require more annotation)
- It only requires comparisons: “A or B?”
- Humans are better at relative judgments than absolute ratings
- Comparisons are cheaper to collect at scale
The Generalization Power
Once trained, the RM can score any (prompt, response) pair, not just the ones it was trained on. This is crucial because:
- We can’t manually rate every possible (prompt, response) pair
- The RL stage will generate novel responses the RM hasn’t seen
- The RM generalizes to reward novel, creative, helpful responses
Stage 3: Reinforcement Learning (RL): Optimizing Against the Reward Model
What It Does
Use the reward model as an objective function. Apply policy gradient RL to maximize reward.
Algorithm: PPO (Proximal Policy Optimization)
For each episode:
1. Prompt: x
2. Generate response: y ~ π_RL(·|x) [sample from current policy]
3. Get reward from RM: r(x, y)
4. Compute advantage: A = r(x, y) - baseline
5. Update policy to increase probability of high-reward actions
6. But constrain: don't diverge too far from SFT model (KL penalty)
The KL Divergence Penalty: Staying True to SFT
Here’s the key innovation: we add a penalty term that prevents the RL model from diverging too far from the SFT model.
RL Objective:
$$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}}(y|x) || \pi_{\text{SFT}}(y|x)]$$
Two terms:
- Reward maximization: $-E[r_\theta(x, y)]$ — RL pushes toward high-reward responses
- KL penalty: $\beta \cdot KL[\pi_{\text{RL}} || \pi_{\text{SFT}}]$ — RL stays close to SFT
Why both?
- Reward-only RL: The model optimizes solely for the RM. But the RM is imperfect (trained on limited human feedback). The model might exploit flaws in the RM or forget useful knowledge from pretraining.
- KL penalty: Keeps the RL model anchored to the SFT model’s knowledge. The model can’t make wild changes just to game the RM.
Hyperparameter β:
- β = 0: Pure reward-seeking (dangerous, can exploit RM)
- β large: Stay very close to SFT (little improvement)
- β ≈ 0.01–0.1: Sweet spot in practice
Why PPO?
PPO is a stable RL algorithm that:
- Doesn’t overshoot policy updates (clip trust regions)
- Works with large models without gradient instability
- Is easier to tune than other policy gradient methods
The Indian Analogy (Expanded)
Imagine teaching a student to be a good doctor:
Stage 1 (SFT): Learning from a mentor doctor
- The mentor shows the student how to diagnose, how to write prescriptions, how to communicate with patients
- After months of learning, the student can handle basic cases reasonably well
- But the student is only good at cases similar to what the mentor taught
Stage 2 (RM): Learning from a committee
- Bring in a committee of experienced doctors
- Present pairs of diagnoses or treatment plans
- The committee judges: “Approach A is better because it’s more thorough” or “B is better because it’s faster”
- An assistant (the reward model) learns to mimic the committee’s preferences
- The assistant hasn’t seen every possible case, but can generalize
Stage 3 (RL): Practicing and improving
- The student now practices on new cases
- After each diagnosis, the committee evaluates (via the assistant’s prediction)
- The student adjusts their approach to match the committee’s standards
- But: The student doesn’t abandon everything the mentor taught
- The student balances: follow the committee’s guidance, but don’t forget the mentor’s core principles
- That balance is the KL penalty — stay somewhat true to the mentor
Why This Three-Stage Approach?
Not just SFT?
SFT alone is limited. We can’t write enough examples. And the model can diverge from good behavior on out-of-distribution prompts.
Not just RL from human annotations?
If humans have to rate every intermediate RL rollout, it’s prohibitively expensive. The RM lets humans rate just a diverse set of comparisons, then the RM scales up those ratings.
Why train the RM separately?
If we did SFT → RL all in one step, the RL would have no learned notion of preference (no reward model). We’d have to ask humans to rate RL rollouts directly, which is expensive and slow.
By training the RM first, we create a fast, learned approximation of human judgment. Then RL can leverage that.
Key Insight: Alignment is Learnable
Before this paper, some researchers believed alignment required:
- Built-in safety mechanisms
- New architectures
- Different training objectives entirely
This paper shows: Alignment is learnable via standard techniques.
- SFT = supervised learning
- RM = classification
- RL = policy gradient
All standard ML techniques, but applied strategically to the alignment problem.
Summary: The Pipeline
GPT-3 (175B)
↓ [SFT: fine-tune on human examples]
π_SFT (instruction-following model)
↓
├─→ [RM: train reward model on comparisons]
│ r_θ (learns what humans prefer)
├─→ [RL: optimize policy with PPO + KL penalty]
│
Result: InstructGPT (1.3B, highly aligned)
72% preferred over GPT-3 by humans
3.5× higher rating despite 130× smaller
This pipeline became the foundation for ChatGPT, Claude, and all modern aligned language models.