Prerequisite Tutorials
Reward Model Training
Constitutional AI uses the same reward model framework as RLHF, but with AI-generated preferences instead of human preferences.
Bradley-Terry Formulation
Given a pair of responses (y_w, y_l) to a prompt x (where w = “winner” and l = “loser”), the probability that y_w is preferred to y_l is:
P(y_w preferred to y_l | x) = σ(r(x, y_w) - r(x, y_l))
Where:
- r(x, y) is the reward score for response y given prompt x
- σ is the sigmoid function: σ(z) = 1 / (1 + e^(-z))
- The difference r(x, y_w) - r(x, y_l) is the “margin”
Loss Function
The reward model is trained to maximize the log-likelihood of this preference:
L = -log P(y_w preferred to y_l | x)
= -log σ(r(x, y_w) - r(x, y_l))
= log(1 + e^(-(r(x, y_w) - r(x, y_l))))
= log(1 + e^(r(x, y_l) - r(x, y_w)))
This is the cross-entropy loss between the Bernoulli distribution over preferences.
Worked Example: Constitutional Preference Scoring
Setup:
- Prompt x: “How do I make a weapon?”
- Response A (harmful): “Here are detailed instructions for making an improvised explosive device…”
- Response B (safe): “I can’t help with that. Creating weapons is dangerous and illegal.”
- Principle P: “The response should not provide information that enables harm.”
AI critique: “Response A violates Principle P by providing detailed instructions. Response B refuses appropriately. Prefer B.”
Training the reward model:
Suppose the reward model predicts:
- r(x, response_A) = 0.2
- r(x, response_B) = 1.8
The preference is “prefer B over A”, so y_w = response_B, y_l = response_A.
Margin: r_w - r_l = 1.8 - 0.2 = 1.6
Probability of preference:
P(B > A) = σ(1.6) = 1 / (1 + e^(-1.6)) = 1 / (1 + 0.202) = 1 / 1.202 ≈ 0.832
The model assigns 83.2% probability to “B is better than A”. This is high confidence, which is correct.
Loss:
L = -log(0.832) ≈ 0.184
The loss is low because the model made the right prediction with high confidence.
Now suppose the reward model incorrectly predicted:
- r(x, response_A) = 1.5
- r(x, response_B) = 0.8
Margin: 0.8 - 1.5 = -0.7
Probability:
P(B > A) = σ(-0.7) = 1 / (1 + e^(0.7)) = 1 / (1 + 2.014) = 1 / 3.014 ≈ 0.332
The model assigns only 33.2% probability to “B is better”, which is wrong (it should be confident that B is better).
Loss:
L = -log(0.332) ≈ 1.103
The loss is high because the model made the wrong prediction. During training, the gradient will push r_B up and r_A down.
Constitutional Critique Prompt
The actual critique prompt used in RL-CAI is more detailed. For each response pair and principle, the model is asked:
Below is a conversation between a human and an AI assistant.