The Code: Reward Model Training and RL Setup
This section demonstrates the core components of RLHF: reward model training and the RL loss setup. This code runs on Google Colab with PyTorch and Transformers.
Code 1: Reward Model Training (Bradley-Terry)
# Reward Model Training with Bradley-Terry Loss
# Runs on Google Colab
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import AutoTokenizer, AutoModel
# Load a small pretrained model as the reward model base
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModel.from_pretrained(model_name)
# Create reward head: [CLS] token → scalar reward
class RewardModel(nn.Module):
def __init__(self, base_model, hidden_size=768):
super().__init__()
self.model = base_model
self.reward_head = nn.Linear(hidden_size, 1) # Output scalar
def forward(self, input_ids, attention_mask):
# Get [CLS] token representation
outputs = self.model(input_ids, attention_mask)
cls_hidden = outputs.last_hidden_state[:, 0, :] # [batch, hidden]
# Map to scalar reward
reward = self.reward_head(cls_hidden).squeeze(-1) # [batch]
return reward
# Initialize model and optimizer
rm = RewardModel(base_model)
optimizer = optim.Adam(rm.parameters(), lr=5e-5)
criterion = nn.BCEWithLogitsLoss() # Binary cross-entropy (sigmoid built-in)
# Mock comparison data: (prompt + response_A, prompt + response_B, label)
# In reality, these come from human annotations
training_data = [
# (input_text_A, input_text_B, label_A_better)
("What is 2+2? Answer: 4", "What is 2+2? Answer: 5", 1),
("Explain gravity. It's a force. Explanation: ✓",
"Explain gravity. It's complicated.", 0),
("What is Python? It's a snake.",
"What is Python? A programming language.", 1),
]
# Training loop (simplified for demonstration)
rm.train()
for epoch in range(3): # 3 epochs for this demo
total_loss = 0
for text_a, text_b, label_a_better in training_data:
# Tokenize both responses
tokens_a = tokenizer(text_a, return_tensors="pt",
padding=True, truncation=True)
tokens_b = tokenizer(text_b, return_tensors="pt",
padding=True, truncation=True)
# Forward pass: get rewards
reward_a = rm(tokens_a["input_ids"], tokens_a["attention_mask"])
reward_b = rm(tokens_b["input_ids"], tokens_b["attention_mask"])
# Bradley-Terry: log(sigmoid(reward_winner - reward_loser))
# This is equivalent to BCEWithLogitsLoss
if label_a_better == 1:
logits = reward_a - reward_b # A better → positive
else:
logits = reward_b - reward_a # B better → positive
# Sigmoid of logits should be close to 1
loss = criterion(logits.unsqueeze(-1),
torch.ones(1, 1)) # Target: sigmoid = 1
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(training_data)
print(f"Epoch {epoch+1}: Loss = {avg_loss:.4f}")
# After training, evaluate on new responses
rm.eval()
with torch.no_grad():
test_text = "Photosynthesis is the process where plants convert light to energy."
test_tokens = tokenizer(test_text, return_tensors="pt",
padding=True, truncation=True)
reward_score = rm(test_tokens["input_ids"],
test_tokens["attention_mask"])
print(f"\nReward for test response: {reward_score.item():.4f}")
What this code does:
- Loads a pretrained language model (DistilBERT)
- Adds a scalar reward head on top (reward = MLP([CLS] token))
- Trains on preference pairs using Bradley-Terry loss
- Computes gradient to increase reward for preferred responses
- Outputs a scalar reward for any (prompt, response) pair
Key insight: The reward model is just a classifier trained on comparisons, reused with a small reward head.
Code 2: RL Loss with KL Penalty
# RL Objective with KL Divergence Penalty
# Demonstrating the PPO-style loss with KL constraint
import torch
import torch.nn.functional as F
from torch.distributions import Categorical
# Mock scenario:
# - RL policy (π_RL) generates a response
# - Reward model rates it
# - KL penalty keeps policy close to SFT baseline (π_SFT)
# Suppose we have log probabilities from both models
batch_size = 2
seq_len = 10
# Mock log probabilities over tokens (in practice, from model.logits)
log_probs_rl = torch.randn(batch_size, seq_len) # π_RL log-probs
log_probs_sft = torch.randn(batch_size, seq_len) # π_SFT log-probs
# Mock rewards from the reward model
rewards = torch.tensor([2.5, -0.3]) # One good response, one bad
# KL divergence penalty coefficient
beta = 0.02
# Compute KL divergence (per example)
# KL[π_RL || π_SFT] ≈ E_y[log π_RL(y) - log π_SFT(y)]
kl_divergence_per_token = log_probs_rl - log_probs_sft # [batch, seq]
kl_per_example = kl_divergence_per_token.mean(dim=1) # [batch]
# RL Loss: maximize reward, minimize KL divergence
# L = -reward + beta * KL
rl_loss = -rewards + beta * kl_per_example
print("Rewards from RM:", rewards.numpy())
print("KL divergence per example:", kl_per_example.detach().numpy())
print("RL Loss (raw):", rl_loss.detach().numpy())
# Average loss across batch
total_loss = rl_loss.mean()
print(f"\nTotal RL Loss: {total_loss.item():.4f}")
# In practice, this would be backpropagated:
# optimizer.zero_grad()
# total_loss.backward()
# optimizer.step()
What this code does:
- Simulates log probabilities from RL policy and SFT baseline
- Computes KL divergence as the difference in log-probs
- Combines reward maximization and KL penalty
- Shows that loss = -reward + beta * KL (trade-off)
Key insight: The RL loss balances two objectives:
- Numerator: Maximize reward (RL wants high-reward responses)
- Denominator: Stay close to SFT (KL penalty prevents divergence)
Code 3: Full RLHF Training Loop (Simplified)
# Simplified RLHF Training Loop
# (In production, this would use PPO with advantage estimation)
import torch
import torch.optim as optim
# Pretend we have three components
class DummyModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.weight = torch.nn.Parameter(torch.tensor(0.5))
def log_prob(self, x):
# Dummy: output a learnable log-probability
return -0.1 * ((x - self.weight)**2) # Higher prob near weight
rl_policy = DummyModel()
sft_policy = DummyModel()
sft_policy.weight.data = torch.tensor(0.0) # Baseline: weight=0
reward_model = lambda x: 2.0 * (x - 1.0)**2 # Reward for outputs near 1.0
optimizer = optim.Adam(rl_policy.parameters(), lr=0.1)
beta = 0.05
# RL training loop
for epoch in range(10):
optimizer.zero_grad()
# Sample from RL policy (for diversity)
x_sampled = rl_policy.weight + torch.randn(5) # Sample around mean
# Compute rewards and log probabilities
rewards = reward_model(x_sampled) # [5]
log_probs_rl = rl_policy.log_prob(x_sampled) # [5]
log_probs_sft = sft_policy.log_prob(x_sampled) # [5]
# KL divergence
kl = (log_probs_rl - log_probs_sft).mean()
# RL loss
loss = -rewards.mean() + beta * kl
# Backprop
loss.backward()
optimizer.step()
if epoch % 3 == 0:
print(f"Epoch {epoch}: Loss={loss.item():.4f}, "
f"Reward={rewards.mean().item():.4f}, "
f"KL={kl.item():.4f}, "
f"Policy param={rl_policy.weight.item():.4f}")
print("\nAfter RL training:")
print(f"RL policy learned weight: {rl_policy.weight.item():.4f}")
print(f"SFT baseline weight: {sft_policy.weight.item():.4f}")
print("(RL policy moved toward reward peak at 1.0, but constrained by KL penalty)")
What this code does:
- Simulates an RL policy trying to maximize a reward
- Shows how KL penalty keeps the policy close to SFT baseline
- Demonstrates the trade-off: RL can improve, but not infinitely
- Shows that without KL penalty (beta=0), policy diverges completely
Key behavior:
- With beta=0.05: Policy moves toward reward peak (~1.0) but stays close to baseline (~0.0)
- With beta=0: Policy would move all the way to 1.0 (unconstrained)
- With beta large: Policy would barely move (constrained too much)
Practical Notes on Implementation
1. Batch Size for RL
RL training is expensive because:
- Generate response from policy: O(seq_len) tokens
- Compute reward: O(seq_len) through RM
- Compute log-probs: O(seq_len) through policy
Use smaller batch sizes (~16) for RL, larger (~128) for RM and SFT.
2. PPO Clipping (Not Shown)
The paper uses PPO, which clips gradients to prevent overshoots:
# Simplified PPO clip
ratio = exp(log_prob_new - log_prob_old)
clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
loss = -torch.min(ratio * advantage, clipped_ratio * advantage)
This prevents one batch from causing huge policy updates.
3. Value Function for Advantage (Not Shown)
In practice, use:
advantage = reward - V(x)
where V(x) is a learned baseline that estimates expected reward given prompt x. This reduces variance in gradient estimates.
4. Data Flow in Production
SFT Model (trained)
↓
├→ Generates responses on diverse prompts
│
├→ Human raters compare pairs (33k comparisons)
│
└→ Reward Model (trained on comparisons)
↓
├→ Scores RL rollouts (cheap, fast)
│
└→ RL Training Loop (PPO)
↓
InstructGPT (aligned model)
Colab-Ready Code Summary
The three code blocks above are self-contained and run on free Google Colab:
- Code 1: Reward model training (~5 min)
- Code 2: RL loss computation (~1 min)
- Code 3: Full RL loop (~2 min)
For production systems:
- Use HuggingFace’s
trllibrary (Text RL) for PPO - Scale to 13B+ parameter models
- Collect tens of thousands of human preference pairs
- Train on 8× GPU setups
The paper’s key contribution isn’t new algorithms (SFT, RM, RL all existed before), but showing how to combine them effectively at scale.