Further Reading — Constitutional AI: Harmlessness from AI Feedback
Further Reading: Constitutional AI
Original Papers and Technical Reports
-
“Constitutional AI: Harmlessness from AI Feedback” (Bai et al., 2022)
- https://arxiv.org/abs/2212.06950
- The primary paper covering both SL-CAI and RL-CAI. Read this after the tutorial.
-
“Claude 3 Model Card” (Anthropic, 2024)
- https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/claude-3-model-card.pdf
- Describes how Constitutional AI is applied to the Claude 3 family of models. Shows real-world deployment.
-
“Towards A Unified Framework for Deep Learning” (related to RLAIF methodology)
- Constitutional AI builds on RLHF. Read Paper 15 if you haven’t already.
Related Research
-
“Let’s Verify Step by Step” (Cobbe et al., OpenAI, 2023)
- https://arxiv.org/abs/2305.20050
- Process Reward Models (PRMs) are used in Constitutional AI’s RL-CAI stage. This paper explains PRMs in detail.
-
“OpenAI’s Model Spec” (2024)
- https://cdn.openai.com/model-spec-gpt4-turbo.pdf
- OpenAI’s approach to specifying model behavior. Similar in spirit to Constitutional AI’s constitution — written principles instead of implicit human preferences.
-
“A General Language Assistant as a Laboratory for Alignment” (Gabriel et al., DeepMind, 2021)
- Foundational work on alignment and transparency. Influenced thinking about Constitutional AI.
-
“Training AI Systems to Self-Criticize” (related to self-critique loop)
- Constitutional AI’s SL-CAI stage uses self-critique. Research on teaching models to critique their own outputs.
Empirical Evaluations and Benchmarks
-
“TruthfulQA: Measuring How Models Mimic Human Falsehoods” (Lin et al., 2021)
- https://arxiv.org/abs/2109.07958
- A benchmark for evaluating truthfulness. Relevant for measuring the “honesty” principle in Constitutional AI.
-
“SAFETY GYMNASIUM: A Unified Safe Reinforcement Learning Benchmark”
- https://arxiv.org/abs/2310.12314
- Safety benchmarks that can evaluate Constitutional AI models on specific harm dimensions.
-
“HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal” (Mazeika et al., 2023)
- https://arxiv.org/abs/2402.04249
- Benchmark for measuring robustness of Constitutional AI models to adversarial prompts.
Policy and Governance
-
“The EU AI Act” (2023)
- https://eur-lex.europa.eu/eli/reg/2023/1689/oj
- EU regulation influenced by transparency ideas in Constitutional AI. Discusses how AI systems should have auditable principles.
-
“AI Governance by Humans-in-the-Loop” (Zhang et al., 2024)
- Discusses how Constitutional AI fits into broader AI governance frameworks.
-
“Values in AI” (Gabriel et al., 2024)
- Essays on how to encode values into AI systems. Constitutional AI as a concrete approach.
Blog Posts and Explainers
-
Anthropic’s Blog: “Constitutional AI: Harmlessness from AI Feedback” (2022)
- https://www.anthropic.com/research/constitutional-ai
- Anthropic’s own blog post explaining Constitutional AI in plain language. Good complement to the paper.
-
“How Claude is made” (Anthropic, 2024)
- Overview of Claude’s training, including Constitutional AI. High-level and accessible.
-
LessWrong discussion threads
- Multiple posts analyzing Constitutional AI’s implications for AI safety and alignment.
Related Code and Implementations
-
Anthropic’s Constitutional AI code (if released)
- Check https://github.com/anthropics for open-source implementations.
-
Hugging Face: RLHF implementations
- https://github.com/huggingface/trl
- While focused on RLHF, includes reward model training code relevant to Constitutional AI’s RL-CAI stage.
-
Reward model training frameworks
- Frameworks for training Bradley-Terry reward models from preference data.
What’s Next: Follow-up Research
-
“Reasoning as Language Processing” (related to reasoning and search)
- Constitutional AI + test-time compute (Paper 23) enables reasoning models. Constitutional principles guide the reasoning process.
-
“Self-Evolved Machine Learning” (rStar2-Agent, Paper 24)
- Self-evolving models generate training data using a constitution of what “good reasoning” looks like. Combines Constitutional AI + test-time compute.
-
“Interpretability of Constitutional Models”
- Emerging research: can we mechanistically understand what principles a Constitutional AI model has learned?
-
“Multimodal Constitutional AI”
- Extending Constitutional AI from text to images, video, and other modalities.
Discussion Questions
-
Constitution Design: If you were writing a constitution for an AI assistant in your country, what 10 principles would you include? How might they differ from Anthropic’s principles?
-
Principle Conflicts: Constitutional AI handles principle conflicts through the reward model. But the reward model learns a balance implicitly. Should conflicts be resolved explicitly in the constitution, or is implicit balance better?
-
Cultural Variation: Should different cultures have different constitutions for their AI systems? Or should AI systems follow universal principles? What are the trade-offs?
-
Auditability: Constitutional AI is more transparent than RLHF because the principles are written. But the reward model that learns from those principles is still a black box. Is this transparency sufficient?
-
Verifiability: How would you verify that a model is actually following a published constitution? What would constitute evidence that the constitution is being applied fairly?
Recommended Reading Order
If you’re diving deep into Constitutional AI:
- Start: This tutorial (sections 01–09)
- Next: Anthropic’s blog post (plain language)
- Then: The full Constitutional AI paper (https://arxiv.org/abs/2212.06950)
- Then: Claude 3 Model Card (real-world application)
- Then: Paper 15 (RLHF) for background
- Then: Paper 16 (Process Reward Models) for reward model details
- Then: Related work on AI governance and values alignment
Key Takeaways for Further Exploration
- Constitutional AI is not just about safety: It’s about scaling human values (encoded in principles) to AI systems. The constitution is a tool for governance and transparency.
- AI feedback is a powerful primitive: Using AI to critique AI opens up new possibilities for self-improvement and scaling. See Paper 24 (rStar2) for an example.
- The constitution is auditable: Unlike RLHF (where human preferences are implicit), the constitution is explicit. This is valuable for transparency and governance.
- Open questions remain: How to write good constitutions, how to handle conflicts, how to verify compliance, how to adapt to new harms — these are still active research areas.