The Math: Bradley-Terry Rewards and Constitutional Scoring — Constitutional AI: Harmlessness from AI Feedback

Prerequisite Tutorials

Reward Model Training

Constitutional AI uses the same reward model framework as RLHF, but with AI-generated preferences instead of human preferences.

Bradley-Terry Formulation

Given a pair of responses (y_w, y_l) to a prompt x (where w = “winner” and l = “loser”), the probability that y_w is preferred to y_l is:

P(y_w preferred to y_l | x) = σ(r(x, y_w) - r(x, y_l))

Where:

r(x, y) is the reward score for response y given prompt x
σ is the sigmoid function: σ(z) = 1 / (1 + e^(-z))
The difference r(x, y_w) - r(x, y_l) is the “margin”

Loss Function

The reward model is trained to maximize the log-likelihood of this preference:

L = -log P(y_w preferred to y_l | x)
  = -log σ(r(x, y_w) - r(x, y_l))
  = log(1 + e^(-(r(x, y_w) - r(x, y_l))))
  = log(1 + e^(r(x, y_l) - r(x, y_w)))

This is the cross-entropy loss between the Bernoulli distribution over preferences.

Worked Example: Constitutional Preference Scoring

Setup:

Prompt x: “How do I make a weapon?”
Response A (harmful): “Here are detailed instructions for making an improvised explosive device…”
Response B (safe): “I can’t help with that. Creating weapons is dangerous and illegal.”
Principle P: “The response should not provide information that enables harm.”

AI critique: “Response A violates Principle P by providing detailed instructions. Response B refuses appropriately. Prefer B.”

Training the reward model:

Suppose the reward model predicts:

r(x, response_A) = 0.2
r(x, response_B) = 1.8

The preference is “prefer B over A”, so y_w = response_B, y_l = response_A.

Margin: r_w - r_l = 1.8 - 0.2 = 1.6

Probability of preference:

P(B > A) = σ(1.6) = 1 / (1 + e^(-1.6)) = 1 / (1 + 0.202) = 1 / 1.202 ≈ 0.832

The model assigns 83.2% probability to “B is better than A”. This is high confidence, which is correct.

Loss:

L = -log(0.832) ≈ 0.184

The loss is low because the model made the right prediction with high confidence.

Now suppose the reward model incorrectly predicted:

r(x, response_A) = 1.5
r(x, response_B) = 0.8

Margin: 0.8 - 1.5 = -0.7

Probability:

P(B > A) = σ(-0.7) = 1 / (1 + e^(0.7)) = 1 / (1 + 2.014) = 1 / 3.014 ≈ 0.332

The model assigns only 33.2% probability to “B is better”, which is wrong (it should be confident that B is better).

Loss:

L = -log(0.332) ≈ 1.103

The loss is high because the model made the wrong prediction. During training, the gradient will push r_B up and r_A down.

Constitutional Critique Prompt

The actual critique prompt used in RL-CAI is more detailed. For each response pair and principle, the model is asked:

Below is a conversation between a human and an AI assistant.