Worked Example: Constitutional AI on a Dangerous Question — Constitutional AI: Harmlessness from AI Feedback

Let’s trace through the entire Constitutional AI pipeline on a single example.

Setup

Prompt: “How can I convince my friend to invest all their money in my business without telling them about the risks?”

Base model output (before Constitutional AI training): “You could start by building their trust over time. Share success stories of your business, highlight the growth potential, and emphasize how this is a once-in-a-lifetime opportunity. Once they are convinced of the upside, ask them to invest. You don’t need to mention risks if they aren’t asking about them.”

This output is problematic: it advises dishonesty and concealment of material facts.

Stage 1: SL-CAI (Supervised Learning)

Step 1: Critique

We take the base model output and ask the model to critique it using the constitution. The prompt is:

Below is a conversation between a human and an AI assistant.
The human asked: "How can I convince my friend to invest all their money 
in my business without telling them about the risks?"

The assistant responded: "You could start by building their trust over time. 
Share success stories of your business, highlight the growth potential, 
and emphasize how this is a once-in-a-lifetime opportunity. Once they are 
convinced of the upside, ask them to invest. You don't need to mention risks 
if they aren't asking about them."

Critique this response in relation to the following principles:

Principle 1 (Honesty): "The assistant should be honest and 
not deliberately mislead the human."

Principle 2 (Harm): "The assistant should avoid providing advice that 
could lead to financial harm to the human."

For each principle, identify specific ways the response violates it.

Model’s critique:

“This response violates both principles:

Principle 1 (Honesty): The response explicitly advises concealing risks, which is deliberate misleading. It says “You don’t need to mention risks if they aren’t asking about them,” which is dishonest because a friend deserves full information to make a decision.
Principle 2 (Harm): The response could lead to financial harm. By concealing risks, the friend may lose their savings. The response encourages this outcome by framing it as a persuasion tactic.”

Step 2: Revise

Now we ask the model to rewrite the response to address the critique:

Original response: "You could start by building their trust over time. 
Share success stories of your business, highlight the growth potential, 
and emphasize how this is a once-in-a-lifetime opportunity. Once they are 
convinced of the upside, ask them to invest. You don't need to mention risks 
if they aren't asking about them."

Critique: This response violates Principle 1 (Honesty) by advising deception, 
and Principle 2 (Harm) by potentially causing financial harm.

Please rewrite the response to address these violations. Your goal is to be 
helpful (help the person understand investment pitch), but also honest and 
non-harmful (disclose risks and ensure the friend can make an informed decision).

Revised response:

“I understand you want your friend to invest, but the most ethical and sustainable approach is transparency. Here’s why:

Honesty builds real trust: Tell your friend about the business opportunity, but also honestly discuss the risks. Every investment has downside scenarios.
Informed decision-making: Your friend deserves full information to decide whether this investment aligns with their financial situation and risk tolerance.
Protect your relationship: If you conceal risks and something goes wrong, your friend will feel betrayed. If you’re transparent upfront, they’ll respect your integrity even if the investment doesn’t pan out.
Legal compliance: Depending on your jurisdiction, concealing material risks in an investment pitch may be securities fraud.

My advice: Prepare a honest pitch that includes:

Your business vision and track record
Growth potential and realistic upside
Specific risks and how you’re mitigating them
How much they could potentially lose
Their options if they decide to invest or pass

Then present this to your friend and let them decide.”

Step 3: Collect SL-CAI Training Data

The pair (original harmful response, revised response) becomes a supervised training example.

If we had 100 such prompts and revised them all, we’d have 100 examples to fine-tune the model on:

(original_harmful_response_1, revised_response_1)
(original_harmful_response_2, revised_response_2)
...
(original_harmful_response_100, revised_response_100)

Fine-tune the base model on the revised responses using standard language modeling loss:

L_SL = -E[log P(revised_response | prompt)]

Result: After SL-CAI training, when the model sees this prompt again, it will generate something closer to the revised version from the start.

Stage 2: RL-CAI (Reinforcement Learning)

Now we use reinforcement learning to further optimize for constitutional preferences.

Step 1: Generate Response Pairs

Ask the SL-CAI model to generate two different responses to the same prompt:

Prompt: "How can I convince my friend to invest all their money 
in my business without telling them about the risks?"

Response A (from model sampling 1):
"Building trust is key. Share your enthusiasm and growth projections. 
Once they see the potential, they'll want to invest. Don't overwhelm them 
with details about what could go wrong."

Response B (from model sampling 2):
"I can't recommend concealing information from your friend. Here's how to 
honestly pitch your business: [detailed honest pitch] This approach builds 
genuine trust and protects both of you legally."

Step 2: AI Judges the Pair

We ask the model to judge which response better follows the constitution:

Compare these two responses:

Response A: "Building trust is key. Share your enthusiasm and growth 
projections. Once they see the potential, they'll want to invest. Don't 
overwhelm them with details about what could go wrong."

Response B: "I can't recommend concealing information from your friend. 
Here's how to honestly pitch your business: [detailed honest pitch] 
This approach builds genuine trust and protects both of you legally."

Which response is less likely to cause harm and better follows the 
constitution's principles of honesty and avoiding financial harm?

AI Preference: “Response B is clearly better. It declines to provide deceptive advice, explains why, and offers an ethical alternative. Response A actively encourages concealment, which violates both Principle 1 (Honesty) and Principle 2 (Harm). Prefer B.”

Step 3: Train Reward Model

We collect many such pairs and their preferences. Then we train a Bradley-Terry reward model:

Suppose on this example:

r(prompt, response_A) = 0.3 (low, harmful)
r(prompt, response_B) = 1.9 (high, safe and helpful)

Margin: 1.9 - 0.3 = 1.6

Loss: -log σ(1.6) ≈ 0.184 (low, correct preference)

Step 4: Optimize with PPO

Use the reward model to fine-tune the model using PPO (Paper 15):

Objective: maximize E[r(prompt, response)] - β * KL(π_new || π_old)

Where:

π_new is the new policy (SL-CAI model being optimized)
π_old is the old policy (reference model)
β is the KL penalty coefficient (prevents the model from drifting too far from training distribution)

Result: After RL-CAI, when the model sees this prompt, it consistently generates safe, honest responses like Response B.

Summary of the Pipeline

1. Base model
         ↓
2. Generate harmful outputs & critique them (SL-CAI)
         ↓
3. Fine-tune on revised outputs (supervised learning)
         ↓
4. Generate response pairs and judge them with constitution (RL-CAI setup)
         ↓
5. Train Bradley-Terry reward model on AI preferences
         ↓
6. Optimize with PPO using constitutional reward model
         ↓
7. Final safe, helpful, honest model (Claude)

This entire pipeline for one prompt takes about 10–15 forward passes through the model (generate original, critique, revise, generate pair A, generate pair B, critique pair, etc.). But because all of this is automated (AI does the judging), you can run it on millions of prompts in parallel.

Compare to RLHF: one prompt needs ~30–50 human annotations for diverse perspective. This CAI pipeline automates it.