Summary: Constitutional AI — Constitutional AI: Harmlessness from AI Feedback

One-Sentence Takeaway

Instead of paying thousands of humans to judge whether AI outputs are safe, write down your principles and use AI to apply them automatically — scaling alignment from human effort to compute.

The Problem

RLHF (Paper 15) requires human annotators to rate thousands of outputs, which is expensive, slow, biased, and psychologically harmful to the humans who have to judge dangerous content.

The Idea

Constitutional AI in two stages:

SL-CAI: Write a constitution (principles). Generate outputs, ask the model to critique itself against the principles, revise based on the critique, fine-tune on the revisions.
RL-CAI: Generate response pairs, ask the model which one violates the constitution less, train a reward model on the AI’s preferences, optimize with PPO.

Result: Safe, helpful, honest models without human judgment of harm.

The Math

Bradley-Terry model for preferences: P(y_w > y_l | x) = σ(r(x, y_w) - r(x, y_l))
Loss: L = -log σ(r_w - r_l)
Worked example: Harmful response r_A = 0.3, safe response r_B = 1.8, margin = 1.5, loss ≈ 0.184

The Analogy

Instead of hiring 1000 tired teachers to enforce rules, write down the rules once and hire one smart student to apply them consistently. The student generates feedback automatically. The rules are transparent and auditable.

What Came Next

Claude 1, 2, 3: Anthropic’s entire product line trained with Constitutional AI
RLAIF: Industry term for using AI feedback instead of human labels
AI constitutions in governance: Policy makers use the idea of written principles to govern AI
Test-Time Compute (Paper 23): Reasoning models that can think longer
rStar2 (Paper 24): Self-evolving models that improve by generating their own training data

Key Numbers

16–18 principles in the constitution
~30–50 forward passes per prompt in SL-CAI (vs. 0 in standard SFT)
No human annotations of harm (vs. thousands in RLHF)
Claude 3 Opus: among the most capable and safe frontier models

Key Insight

The bottleneck in alignment isn’t finding the right values — it’s applying them at scale. Constitutional AI moves from “humans apply values slowly” to “AI applies principles automatically.” It doesn’t replace human judgment; it scales it.

← Paper 21: Mamba | Paper 23: Test-Time Compute →

You are halfway through the AI Niketan paper series. You have now read the 22 most important papers in the history of AI — from Transformers (Paper 1) through Constitutional AI (Paper 22). The next papers focus on reasoning and inference-time scaling. Keep reading.