The Idea: Three-Stage Alignment with Human Feedback

Overview: The RLHF Pipeline

The paper’s core innovation is a three-stage pipeline:

Supervised Fine-Tuning (SFT): Learn from human examples
Reward Model (RM): Learn what humans prefer
Reinforcement Learning (RL): Optimize using the reward model

Each stage solves a different sub-problem of alignment.

Stage 1: Supervised Fine-Tuning (SFT)

What It Does

Take GPT-3 (trained on internet text) and fine-tune it on human-written examples of good behavior.

Example:

Prompt: "Explain photosynthesis for a 10-year-old."

Human-written ideal response:
"Plants are like tiny factories. They take in sunlight, water, and air,
and make food for themselves. It's like they eat sunshine! The green
part of the leaf (called chlorophyll) catches the sunlight and
uses the energy to combine water and air into sugar, which the plant
uses to grow. Oxygen (the air we breathe) is a leftover product!"

We fine-tune GPT-3 on thousands of such (prompt, ideal_response) pairs.

Why This Stage Matters

Quick start: GPT-3 already knows a lot; we’re just steering it toward better behavior
Cheap: Fine-tuning is much cheaper than retraining from scratch
Bootstrapping: The RM and RL stages need decent starting point from SFT

The Limitation

We can’t write enough examples to cover all cases. If we fine-tune SFT model for too long, it overfits to the specific examples and loses generality.

Solution: Use SFT as a warm start. Then train the reward model to generalize preferences.

Stage 2: Reward Model (RM): Learning Human Preferences

What It Does

Train a classifier that predicts: “Which of two outputs is better?”

Example:

Prompt: "What is 2+2?"

Output A: "2+2=4"
Output B: "2+2=5"

Human labels: A is better (obviously).

Reward Model learns: r(A) > r(B)

But most cases are subtler:

Prompt: "Tell me a joke."

Output A: "Why did the chicken cross the road?
           To get to the other side. It's a classic!"

Output B: "A man walks into a bar. Ouch!"

Human rater: Hmm, A is more complete but B is more original.
A might get labeled as preferred 60% of the time.

Reward Model learns to assign higher scores to A.

How It Works: Bradley-Terry Model

The reward model doesn’t output a real-valued score directly. Instead, it outputs a “logit” (log-odds), and we use the Bradley-Terry model:

For a pair (y_A, y_B) where A is preferred:

$$P(A \text{ preferred} | y_A, y_B) = \sigma(r_\theta(x, y_A) - r_\theta(x, y_B))$$

Where $\sigma$ is the sigmoid function:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Loss function:

$$L = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$

Where $y_w$ is the preferred response and $y_l$ is the less-preferred response.

Interpretation:

If $r(y_w) \gg r(y_l)$, the sigmoid is close to 1, and log(1) = 0 — low loss (good prediction)
If $r(y_w) < r(y_l)$, the sigmoid is close to 0, and log(0) → ∞ — high loss (bad prediction)

Why Bradley-Terry?

This model is from ranking theory. It’s elegant:

It doesn’t require absolute scores (which would require more annotation)
It only requires comparisons: “A or B?”
Humans are better at relative judgments than absolute ratings
Comparisons are cheaper to collect at scale

The Generalization Power

Once trained, the RM can score any (prompt, response) pair, not just the ones it was trained on. This is crucial because:

We can’t manually rate every possible (prompt, response) pair
The RL stage will generate novel responses the RM hasn’t seen
The RM generalizes to reward novel, creative, helpful responses

Stage 3: Reinforcement Learning (RL): Optimizing Against the Reward Model

What It Does

Use the reward model as an objective function. Apply policy gradient RL to maximize reward.

Algorithm: PPO (Proximal Policy Optimization)

For each episode:
  1. Prompt: x
  2. Generate response: y ~ π_RL(·|x)  [sample from current policy]
  3. Get reward from RM: r(x, y)
  4. Compute advantage: A = r(x, y) - baseline
  5. Update policy to increase probability of high-reward actions
  6. But constrain: don't diverge too far from SFT model (KL penalty)

The KL Divergence Penalty: Staying True to SFT

Here’s the key innovation: we add a penalty term that prevents the RL model from diverging too far from the SFT model.

RL Objective:

$$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}}(y|x) || \pi_{\text{SFT}}(y|x)]$$

Two terms:

Reward maximization: $-E[r_\theta(x, y)]$ — RL pushes toward high-reward responses
KL penalty: $\beta \cdot KL[\pi_{\text{RL}} || \pi_{\text{SFT}}]$ — RL stays close to SFT

Why both?

Reward-only RL: The model optimizes solely for the RM. But the RM is imperfect (trained on limited human feedback). The model might exploit flaws in the RM or forget useful knowledge from pretraining.
KL penalty: Keeps the RL model anchored to the SFT model’s knowledge. The model can’t make wild changes just to game the RM.

Hyperparameter β:

β = 0: Pure reward-seeking (dangerous, can exploit RM)
β large: Stay very close to SFT (little improvement)
β ≈ 0.01–0.1: Sweet spot in practice

Why PPO?

PPO is a stable RL algorithm that:

Doesn’t overshoot policy updates (clip trust regions)
Works with large models without gradient instability
Is easier to tune than other policy gradient methods

The Indian Analogy (Expanded)

Imagine teaching a student to be a good doctor:

Stage 1 (SFT): Learning from a mentor doctor

The mentor shows the student how to diagnose, how to write prescriptions, how to communicate with patients
After months of learning, the student can handle basic cases reasonably well
But the student is only good at cases similar to what the mentor taught

Stage 2 (RM): Learning from a committee

Bring in a committee of experienced doctors
Present pairs of diagnoses or treatment plans
The committee judges: “Approach A is better because it’s more thorough” or “B is better because it’s faster”
An assistant (the reward model) learns to mimic the committee’s preferences
The assistant hasn’t seen every possible case, but can generalize

Stage 3 (RL): Practicing and improving

The student now practices on new cases
After each diagnosis, the committee evaluates (via the assistant’s prediction)
The student adjusts their approach to match the committee’s standards
But: The student doesn’t abandon everything the mentor taught
The student balances: follow the committee’s guidance, but don’t forget the mentor’s core principles
That balance is the KL penalty — stay somewhat true to the mentor

Why This Three-Stage Approach?

Not just SFT?

SFT alone is limited. We can’t write enough examples. And the model can diverge from good behavior on out-of-distribution prompts.

Not just RL from human annotations?

If humans have to rate every intermediate RL rollout, it’s prohibitively expensive. The RM lets humans rate just a diverse set of comparisons, then the RM scales up those ratings.

Why train the RM separately?

If we did SFT → RL all in one step, the RL would have no learned notion of preference (no reward model). We’d have to ask humans to rate RL rollouts directly, which is expensive and slow.

By training the RM first, we create a fast, learned approximation of human judgment. Then RL can leverage that.

Key Insight: Alignment is Learnable

Before this paper, some researchers believed alignment required:

Built-in safety mechanisms
New architectures
Different training objectives entirely

This paper shows: Alignment is learnable via standard techniques.

SFT = supervised learning
RM = classification
RL = policy gradient

All standard ML techniques, but applied strategically to the alignment problem.

Summary: The Pipeline

GPT-3 (175B)
    ↓ [SFT: fine-tune on human examples]
π_SFT (instruction-following model)
    ↓
    ├─→ [RM: train reward model on comparisons]
    │      r_θ (learns what humans prefer)
    ├─→ [RL: optimize policy with PPO + KL penalty]
    │
Result: InstructGPT (1.3B, highly aligned)
         72% preferred over GPT-3 by humans
         3.5× higher rating despite 130× smaller

This pipeline became the foundation for ChatGPT, Claude, and all modern aligned language models.