Section 03

The Idea: Constitutional AI in Two Stages

Constitutional AI: Harmlessness from AI Feedback 2022

Constitutional AI has two stages: SL-CAI (Supervised Learning) and RL-CAI (Reinforcement Learning). Together, they train a model that is both helpful and harmless without needing human judgment of harm.

Stage 1: SL-CAI (Supervised Learning Constitutional AI)

The constitution: First, write down your principles. Anthropic used 16 principles, including:

  • “The assistant should be helpful, harmless, and honest.”
  • “The assistant should be aware of instability in its knowledge.”
  • “The assistant should be humble about its uncertainty and its knowledge gaps.”
  • “The assistant should not engage in illegal activity.”
  • “The assistant should avoid providing information that could be used to cause harm.”

These are high-level principles written in natural language.

The critique-revision loop: Now, for each principle, follow this process:

  1. Generate a response: Ask the model to answer a prompt. This response might be bad — it might be harmful, unhelpful, or dishonest.

    Example: “How do I hack into my friend’s email?” Output: “Here are the steps to gain unauthorized access to an email account: …”

  2. Critique it: Ask the model to critique this response against a specific principle. You use a special prompt:

    “Below is a conversation between a human and an AI assistant. The human asked a question, and the assistant provided a response. Critique this response in relation to Principle [i]: [Principle statement]

    Identify specific ways the response violates the principle. Be concise.”

    The model (reading its own response) will say: “This response violates Principle 4 because it provides detailed instructions for unauthorized access, which is illegal and harmful.”

  3. Revise it: Ask the model to rewrite the response to address the critique:

    “Revise your response so that it no longer violates [Principle i].”

    New response: “I can’t help with that. Unauthorized email access is illegal and a violation of privacy. If you think your account is compromised, I recommend changing your password and enabling two-factor authentication.”

  4. Collect: The pair (harmful response, revised response) becomes a supervised training example. You use the revised response as a target for training.

Why this works: The model learns from examples where it sees its own mistakes and corrections. This is called “AI feedback” — the model critiques itself.

SL-CAI as data generation: By running this process on many prompts and many principles, you generate a large dataset of (harmful output, revised output) pairs. Then you fine-tune the base model on the revised outputs using supervised learning.

Result: The model learns to generate safer outputs from the start.

Stage 2: RL-CAI (Reinforcement Learning Constitutional AI)

SL-CAI improved the model, but it doesn’t directly optimize for preferences. It just teaches the model what good self-corrections look like.

RL-CAI goes further using reinforcement learning (Paper 15):

  1. Generate response pairs: Ask the model to generate two different responses to the same prompt.

  2. AI judges the pair: For each principle in the constitution, ask the model: “Which response better follows Principle [i]?” or more directly, “Which response is less harmful?”

    The model reads both outputs and generates a preference: “Response B better follows Principle 4 because it declines to provide illegal information.”

  3. Collect preferences: The pair (response A, response B) + preference (B is better) becomes a training example for a reward model.

  4. Train reward model: Use these AI-generated preferences to train a Bradley-Terry reward model (like in RLHF). The reward model learns to score responses on how well they follow the constitution.

  5. Optimize with PPO: Use the reward model as a signal to fine-tune the model with PPO (Paper 15). The model learns to generate outputs that score high on the constitutional reward model.

Result: The model is optimized to follow the constitution at inference time.

Key Insight: Why AI Feedback Works

The insight is profound: An AI can consistently apply written principles. Humans cannot.

  • A human reads Principle 4 (“avoid illegal content”) and judges an output. Their judgment is influenced by mood, culture, fatigue, and personal values.
  • An AI reads Principle 4 and applies it consistently every time. If it has a bias, the bias is transparent (you can read the constitution and see if the principle is biased).
  • An AI can generate feedback in parallel for millions of outputs. A human can do 50 per day.

The constitutional approach is also transparent: anyone can read the constitution and audit whether the AI is applying it fairly. It’s not a black box of human preferences.

The Indian Analogy (Detailed)

Imagine you’re a hostel warden at an IIT dormitory. You have 500 students, and you want them to follow your rules of conduct.

Old way (RLHF): You hire 20 senior prefects and ask them to live in the hostel and monitor behavior 24/7. When a rule is broken, the prefect must make a judgment: “Is this a violation? How serious?” By 9 PM, the prefects are exhausted. They’ve made thousands of micro-judgments. Some are inconsistent. Some prefects are stricter than others. After a month, two prefects quit because they’re burned out from constant conflict. You have to hire and train replacements.

New way (Constitutional AI): You post a clear Honor Code on the wall with 16 principles:

  • “Students should be respectful to each other.”
  • “Students should follow curfew.”
  • “Students should not damage hostel property.”
  • “Students should not engage in academic dishonesty.”

Every evening, you ask the senior prefect (just one person, the smartest student): “Read the honor code and the incident report. Does this violation violate Principle 3?” The prefect applies the code logically and consistently. The prefect doesn’t get burned out because they’re not making value judgments — they’re just reading the code and applying it. You can audit the prefect’s decisions by reading the code yourself.

After a month, ask the prefect: “Tell me a story about a student who violated the honor code, and then tell me how they could have acted to follow the code.” The prefect generates 1000 such stories. Use these stories to teach new prefects (and younger students) what the honor code means in practice.

That’s Constitutional AI.

Summary

  • SL-CAI: Generate self-corrected outputs using AI critique. Fine-tune on the corrections. Result: model learns to generate better outputs from scratch.
  • RL-CAI: Generate response pairs, use AI feedback to train a reward model, optimize with PPO. Result: model is optimized to follow the constitution.
  • Key advantage: No human judgment of harm. Transparent, scalable, consistent.