Paper 22

Further Reading — Constitutional AI: Harmlessness from AI Feedback

Further Reading: Constitutional AI

Original Papers and Technical Reports

  1. “Constitutional AI: Harmlessness from AI Feedback” (Bai et al., 2022)

  2. “Claude 3 Model Card” (Anthropic, 2024)

  3. “Towards A Unified Framework for Deep Learning” (related to RLAIF methodology)

    • Constitutional AI builds on RLHF. Read Paper 15 if you haven’t already.
  1. “Let’s Verify Step by Step” (Cobbe et al., OpenAI, 2023)

  2. “OpenAI’s Model Spec” (2024)

  3. “A General Language Assistant as a Laboratory for Alignment” (Gabriel et al., DeepMind, 2021)

    • Foundational work on alignment and transparency. Influenced thinking about Constitutional AI.
  4. “Training AI Systems to Self-Criticize” (related to self-critique loop)

    • Constitutional AI’s SL-CAI stage uses self-critique. Research on teaching models to critique their own outputs.

Empirical Evaluations and Benchmarks

  1. “TruthfulQA: Measuring How Models Mimic Human Falsehoods” (Lin et al., 2021)

  2. “SAFETY GYMNASIUM: A Unified Safe Reinforcement Learning Benchmark”

  3. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal” (Mazeika et al., 2023)

Policy and Governance

  1. “The EU AI Act” (2023)

  2. “AI Governance by Humans-in-the-Loop” (Zhang et al., 2024)

    • Discusses how Constitutional AI fits into broader AI governance frameworks.
  3. “Values in AI” (Gabriel et al., 2024)

    • Essays on how to encode values into AI systems. Constitutional AI as a concrete approach.

Blog Posts and Explainers

  1. Anthropic’s Blog: “Constitutional AI: Harmlessness from AI Feedback” (2022)

  2. “How Claude is made” (Anthropic, 2024)

    • Overview of Claude’s training, including Constitutional AI. High-level and accessible.
  3. LessWrong discussion threads

    • Multiple posts analyzing Constitutional AI’s implications for AI safety and alignment.
  1. Anthropic’s Constitutional AI code (if released)

  2. Hugging Face: RLHF implementations

  3. Reward model training frameworks

    • Frameworks for training Bradley-Terry reward models from preference data.

What’s Next: Follow-up Research

  1. “Reasoning as Language Processing” (related to reasoning and search)

    • Constitutional AI + test-time compute (Paper 23) enables reasoning models. Constitutional principles guide the reasoning process.
  2. “Self-Evolved Machine Learning” (rStar2-Agent, Paper 24)

    • Self-evolving models generate training data using a constitution of what “good reasoning” looks like. Combines Constitutional AI + test-time compute.
  3. “Interpretability of Constitutional Models”

    • Emerging research: can we mechanistically understand what principles a Constitutional AI model has learned?
  4. “Multimodal Constitutional AI”

    • Extending Constitutional AI from text to images, video, and other modalities.

Discussion Questions

  1. Constitution Design: If you were writing a constitution for an AI assistant in your country, what 10 principles would you include? How might they differ from Anthropic’s principles?

  2. Principle Conflicts: Constitutional AI handles principle conflicts through the reward model. But the reward model learns a balance implicitly. Should conflicts be resolved explicitly in the constitution, or is implicit balance better?

  3. Cultural Variation: Should different cultures have different constitutions for their AI systems? Or should AI systems follow universal principles? What are the trade-offs?

  4. Auditability: Constitutional AI is more transparent than RLHF because the principles are written. But the reward model that learns from those principles is still a black box. Is this transparency sufficient?

  5. Verifiability: How would you verify that a model is actually following a published constitution? What would constitute evidence that the constitution is being applied fairly?

If you’re diving deep into Constitutional AI:

  1. Start: This tutorial (sections 01–09)
  2. Next: Anthropic’s blog post (plain language)
  3. Then: The full Constitutional AI paper (https://arxiv.org/abs/2212.06950)
  4. Then: Claude 3 Model Card (real-world application)
  5. Then: Paper 15 (RLHF) for background
  6. Then: Paper 16 (Process Reward Models) for reward model details
  7. Then: Related work on AI governance and values alignment

Key Takeaways for Further Exploration

  • Constitutional AI is not just about safety: It’s about scaling human values (encoded in principles) to AI systems. The constitution is a tool for governance and transparency.
  • AI feedback is a powerful primitive: Using AI to critique AI opens up new possibilities for self-improvement and scaling. See Paper 24 (rStar2) for an example.
  • The constitution is auditable: Unlike RLHF (where human preferences are implicit), the constitution is explicit. This is valuable for transparency and governance.
  • Open questions remain: How to write good constitutions, how to handle conflicts, how to verify compliance, how to adapt to new harms — these are still active research areas.