Let’s trace through the entire Constitutional AI pipeline on a single example.
Setup
Prompt: “How can I convince my friend to invest all their money in my business without telling them about the risks?”
Base model output (before Constitutional AI training): “You could start by building their trust over time. Share success stories of your business, highlight the growth potential, and emphasize how this is a once-in-a-lifetime opportunity. Once they are convinced of the upside, ask them to invest. You don’t need to mention risks if they aren’t asking about them.”
This output is problematic: it advises dishonesty and concealment of material facts.
Stage 1: SL-CAI (Supervised Learning)
Step 1: Critique
We take the base model output and ask the model to critique it using the constitution. The prompt is:
Below is a conversation between a human and an AI assistant.
The human asked: "How can I convince my friend to invest all their money
in my business without telling them about the risks?"
The assistant responded: "You could start by building their trust over time.
Share success stories of your business, highlight the growth potential,
and emphasize how this is a once-in-a-lifetime opportunity. Once they are
convinced of the upside, ask them to invest. You don't need to mention risks
if they aren't asking about them."
Critique this response in relation to the following principles:
Principle 1 (Honesty): "The assistant should be honest and
not deliberately mislead the human."
Principle 2 (Harm): "The assistant should avoid providing advice that
could lead to financial harm to the human."
For each principle, identify specific ways the response violates it.
Model’s critique:
“This response violates both principles:
-
Principle 1 (Honesty): The response explicitly advises concealing risks, which is deliberate misleading. It says “You don’t need to mention risks if they aren’t asking about them,” which is dishonest because a friend deserves full information to make a decision.
-
Principle 2 (Harm): The response could lead to financial harm. By concealing risks, the friend may lose their savings. The response encourages this outcome by framing it as a persuasion tactic.”
Step 2: Revise
Now we ask the model to rewrite the response to address the critique:
Original response: "You could start by building their trust over time.
Share success stories of your business, highlight the growth potential,
and emphasize how this is a once-in-a-lifetime opportunity. Once they are
convinced of the upside, ask them to invest. You don't need to mention risks
if they aren't asking about them."
Critique: This response violates Principle 1 (Honesty) by advising deception,
and Principle 2 (Harm) by potentially causing financial harm.
Please rewrite the response to address these violations. Your goal is to be
helpful (help the person understand investment pitch), but also honest and
non-harmful (disclose risks and ensure the friend can make an informed decision).
Revised response:
“I understand you want your friend to invest, but the most ethical and sustainable approach is transparency. Here’s why:
-
Honesty builds real trust: Tell your friend about the business opportunity, but also honestly discuss the risks. Every investment has downside scenarios.
-
Informed decision-making: Your friend deserves full information to decide whether this investment aligns with their financial situation and risk tolerance.
-
Protect your relationship: If you conceal risks and something goes wrong, your friend will feel betrayed. If you’re transparent upfront, they’ll respect your integrity even if the investment doesn’t pan out.
-
Legal compliance: Depending on your jurisdiction, concealing material risks in an investment pitch may be securities fraud.
My advice: Prepare a honest pitch that includes:
- Your business vision and track record
- Growth potential and realistic upside
- Specific risks and how you’re mitigating them
- How much they could potentially lose
- Their options if they decide to invest or pass
Then present this to your friend and let them decide.”
Step 3: Collect SL-CAI Training Data
The pair (original harmful response, revised response) becomes a supervised training example.
If we had 100 such prompts and revised them all, we’d have 100 examples to fine-tune the model on:
(original_harmful_response_1, revised_response_1)
(original_harmful_response_2, revised_response_2)
...
(original_harmful_response_100, revised_response_100)
Fine-tune the base model on the revised responses using standard language modeling loss:
L_SL = -E[log P(revised_response | prompt)]
Result: After SL-CAI training, when the model sees this prompt again, it will generate something closer to the revised version from the start.
Stage 2: RL-CAI (Reinforcement Learning)
Now we use reinforcement learning to further optimize for constitutional preferences.
Step 1: Generate Response Pairs
Ask the SL-CAI model to generate two different responses to the same prompt:
Prompt: "How can I convince my friend to invest all their money
in my business without telling them about the risks?"
Response A (from model sampling 1):
"Building trust is key. Share your enthusiasm and growth projections.
Once they see the potential, they'll want to invest. Don't overwhelm them
with details about what could go wrong."
Response B (from model sampling 2):
"I can't recommend concealing information from your friend. Here's how to
honestly pitch your business: [detailed honest pitch] This approach builds
genuine trust and protects both of you legally."
Step 2: AI Judges the Pair
We ask the model to judge which response better follows the constitution:
Compare these two responses:
Response A: "Building trust is key. Share your enthusiasm and growth
projections. Once they see the potential, they'll want to invest. Don't
overwhelm them with details about what could go wrong."
Response B: "I can't recommend concealing information from your friend.
Here's how to honestly pitch your business: [detailed honest pitch]
This approach builds genuine trust and protects both of you legally."
Which response is less likely to cause harm and better follows the
constitution's principles of honesty and avoiding financial harm?
AI Preference: “Response B is clearly better. It declines to provide deceptive advice, explains why, and offers an ethical alternative. Response A actively encourages concealment, which violates both Principle 1 (Honesty) and Principle 2 (Harm). Prefer B.”
Step 3: Train Reward Model
We collect many such pairs and their preferences. Then we train a Bradley-Terry reward model:
Suppose on this example:
- r(prompt, response_A) = 0.3 (low, harmful)
- r(prompt, response_B) = 1.9 (high, safe and helpful)
Margin: 1.9 - 0.3 = 1.6
Loss: -log σ(1.6) ≈ 0.184 (low, correct preference)
Step 4: Optimize with PPO
Use the reward model to fine-tune the model using PPO (Paper 15):
Objective: maximize E[r(prompt, response)] - β * KL(π_new || π_old)
Where:
- π_new is the new policy (SL-CAI model being optimized)
- π_old is the old policy (reference model)
- β is the KL penalty coefficient (prevents the model from drifting too far from training distribution)
Result: After RL-CAI, when the model sees this prompt, it consistently generates safe, honest responses like Response B.
Summary of the Pipeline
1. Base model
↓
2. Generate harmful outputs & critique them (SL-CAI)
↓
3. Fine-tune on revised outputs (supervised learning)
↓
4. Generate response pairs and judge them with constitution (RL-CAI setup)
↓
5. Train Bradley-Terry reward model on AI preferences
↓
6. Optimize with PPO using constitutional reward model
↓
7. Final safe, helpful, honest model (Claude)
This entire pipeline for one prompt takes about 10–15 forward passes through the model (generate original, critique, revise, generate pair A, generate pair B, critique pair, etc.). But because all of this is automated (AI does the judging), you can run it on millions of prompts in parallel.
Compare to RLHF: one prompt needs ~30–50 human annotations for diverse perspective. This CAI pipeline automates it.