Paper 22
Intermediate

Constitutional AI: Harmlessness from AI Feedback

What This Paper Did

RLHF (Paper 15) makes models helpful and honest by training a reward model on thousands of human preference labels — but obtaining those labels is expensive, slow, and psychologically taxing for humans who must judge harmful content. Constitutional AI (CAI) replaces human harm judgement with AI feedback: write a constitution (a list of 16–18 principles for how the AI should behave), then use an AI to critique and revise model outputs against that constitution, and use the AI’s preferences (not humans’) as the signal for training the reward model. The approach has two stages:

  1. SL-CAI (Supervised Learning): Generate a harmful response, ask the model to critique it against each principle in the constitution, collect the revisions as supervised training data, fine-tune the model on these self-corrected responses.

  2. RL-CAI (Reinforcement Learning): Generate pairs of model outputs, ask the AI (not a human) to judge which one violates the constitution less, train a reward model on these AI-generated preferences, then use PPO to optimize the model against the constitutional reward model.

Result: Claude models trained with Constitutional AI are both helpful and harmless, with harmlessness ensured by an AI judge rather than burned-out human reviewers. The approach scales to arbitrarily large datasets of AI-generated feedback.

RLHF bottleneck:
  1000 harmful examples → 1000 human annotations → expensive, slow, psychologically taxing
  
Constitutional AI:
  1000 harmful examples → AI critiques all 1000 in parallel → SL-CAI + RLAIF → no human burnout
  
Key equations:
  - RLAIF reward: r(x, y) = log P_RM(preferred | x, y_w, y_l)
  - Bradley-Terry loss: L = -E[log σ(r(x, y_w) - r(x, y_l))]
  - Same as RLHF but with AI-generated preferences instead of human labels

The Indian Analogy

Imagine you run a boarding school and need to enforce a code of conduct. The old way (RLHF): hire 1000 teachers to stand in hallways and say “No, that violates Rule 3” every time a student breaks a rule. The teachers get exhausted from constant conflict, and some start doubting themselves.

The new way (Constitutional AI): write down your code of conduct on a poster, then hire one very smart senior prefect who reads the rules every morning and spends their day asking younger students, “Did that action violate Rule 2? Why or why not? How would you rewrite the situation to follow the rules?” The senior prefect generates feedback automatically for every situation, following the written rules. The younger students learn by revising their behaviour based on the prefect’s logical critique — not arbitrary authority.

The constitution is transparent and auditable. Anyone can read the rules and see if the prefect is applying them fairly. The rules don’t change based on the mood of a human reviewer.

Comparison: RLHF vs. Constitutional AI

AspectRLHF (Paper 15)Constitutional AI
Preference sourceThousands of human annotatorsSingle AI model (the model itself or a twin)
ScalingLinear in human effort; bottleneckExponential in compute; no human bottleneck
BiasReflects human biases (culture, mood, disagreement)Reflects AI training data biases; consistent application
AuditabilityImplicit (hard to know why humans chose A over B)Explicit (constitution is written and readable)
SpeedSlow (humans are slow)Fast (AI is fast)
Psychological burdenHumans judge harmful content; burnout riskNo humans judge harmful content directly
GeneralityTask-specific (need humans for each task)Generalizes via constitution principles

Read in This Order

SectionWhat You Will LearnDifficultyTime
01-contextWhy RLHF has a human-feedback bottleneck🟢5 min
02-the-problemSpecific failures of human labellers (inconsistency, bias, burnout)🟢4 min
03-the-ideaHow constitutional critique and revision work; the intuition🟡7 min
04-the-mathBradley-Terry reward model; critique prompt structure🟡8 min
05-worked-exampleStep-by-step trace of CAI on a dangerous question🟡7 min
06-the-codePython code showing the critique-revision loop🟡6 min
07-limitationsConstitution quality, AI critic bias, computational cost🟡5 min
08-impactClaude 1–3, RLAIF across industry, AI governance🟢5 min
09-summaryOne-sentence takeaway, what came next🟢2 min

Before You Read: Math Tutorials You Need

ASCII Diagram

Old (RLHF):
  Model output → Human reviewer (tired, biased) → Preference label

              1000 reviewers needed
              Bottleneck!

New (Constitutional AI):
  Model output → AI Critic (reads constitution) → Critique + revised output

              Use AI's feedback to train reward model
              No human bottleneck
              
Flow of Constitutional AI:
  
  1. Start: Model generates response

  2. SL-CAI: Ask AI "Does this violate Rule 1, 2, ..., N?"

  3. Revise: Model self-corrects based on critique

  4. Collect: Supervised training data (response → revised)

  5. RL-CAI: Generate response pairs (A, B)

  6. Compare: AI judge "Which violates constitution less?" → preference

  7. Reward: Train reward model on AI preferences (Bradley-Terry)

  8. Optimize: Use PPO with constitutional reward model

Paper 21: Mamba | Paper 23: Test-Time Compute →

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.