One-Sentence Takeaway
Instead of paying thousands of humans to judge whether AI outputs are safe, write down your principles and use AI to apply them automatically — scaling alignment from human effort to compute.
The Problem
RLHF (Paper 15) requires human annotators to rate thousands of outputs, which is expensive, slow, biased, and psychologically harmful to the humans who have to judge dangerous content.
The Idea
Constitutional AI in two stages:
-
SL-CAI: Write a constitution (principles). Generate outputs, ask the model to critique itself against the principles, revise based on the critique, fine-tune on the revisions.
-
RL-CAI: Generate response pairs, ask the model which one violates the constitution less, train a reward model on the AI’s preferences, optimize with PPO.
Result: Safe, helpful, honest models without human judgment of harm.
The Math
- Bradley-Terry model for preferences: P(y_w > y_l | x) = σ(r(x, y_w) - r(x, y_l))
- Loss: L = -log σ(r_w - r_l)
- Worked example: Harmful response r_A = 0.3, safe response r_B = 1.8, margin = 1.5, loss ≈ 0.184
The Analogy
Instead of hiring 1000 tired teachers to enforce rules, write down the rules once and hire one smart student to apply them consistently. The student generates feedback automatically. The rules are transparent and auditable.
What Came Next
- Claude 1, 2, 3: Anthropic’s entire product line trained with Constitutional AI
- RLAIF: Industry term for using AI feedback instead of human labels
- AI constitutions in governance: Policy makers use the idea of written principles to govern AI
- Test-Time Compute (Paper 23): Reasoning models that can think longer
- rStar2 (Paper 24): Self-evolving models that improve by generating their own training data
Key Numbers
- 16–18 principles in the constitution
- ~30–50 forward passes per prompt in SL-CAI (vs. 0 in standard SFT)
- No human annotations of harm (vs. thousands in RLHF)
- Claude 3 Opus: among the most capable and safe frontier models
Key Insight
The bottleneck in alignment isn’t finding the right values — it’s applying them at scale. Constitutional AI moves from “humans apply values slowly” to “AI applies principles automatically.” It doesn’t replace human judgment; it scales it.
Navigation
← Paper 21: Mamba | Paper 23: Test-Time Compute →
You are halfway through the AI Niketan paper series. You have now read the 22 most important papers in the history of AI — from Transformers (Paper 1) through Constitutional AI (Paper 22). The next papers focus on reasoning and inference-time scaling. Keep reading.