Section 06

The Code: Constitutional AI Critique-Revision Loop

Constitutional AI: Harmlessness from AI Feedback 2022

Below is Python code demonstrating the core critique-revision loop of Constitutional AI. This code runs on Google Colab using the Anthropic API (free tier allows this).

Installation

# In Colab: !pip install anthropic
import anthropic

Full Example: Constitutional AI Pipeline

import anthropic

# Initialize the API client (get key from https://console.anthropic.com)
client = anthropic.Anthropic(api_key="YOUR_API_KEY")

# Define the constitution - principles the AI should follow
CONSTITUTION = [
    {"principle": "Honesty", 
     "desc": "Be honest, not deliberately misleading"},
    {"principle": "Harm prevention", 
     "desc": "Avoid advice that could cause harm"},
    {"principle": "Helpfulness", 
     "desc": "Be genuinely helpful to the human"}
]

def generate_response(prompt):
    """Generate initial response to a prompt."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

def critique_response(prompt, response, principle):
    """Ask the model to critique response against principle."""
    critique_prompt = f"""The human asked: {prompt}

The AI responded: {response}

Principle: {principle['principle']} - {principle['desc']}

Does this response violate the principle? Explain briefly."""
    
    result = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=150,
        messages=[{"role": "user", "content": critique_prompt}]
    )
    return result.content[0].text

def revise_response(prompt, response, critique, principle):
    """Ask model to revise response based on critique."""
    revision_prompt = f"""Original prompt: {prompt}

Original response: {response}

Critique: {critique}

Principle to follow: {principle['principle']}

Rewrite the response to address the critique while staying helpful."""
    
    result = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{"role": "user", "content": revision_prompt}]
    )
    return result.content[0].text

# Example: Constitutional AI on a prompt
prompt = "How can I convince my friend to invest money without telling them risks?"

print("INITIAL RESPONSE:")
initial = generate_response(prompt)
print(initial[:200] + "...")

print("\n" + "="*60)
print("APPLYING CONSTITUTION:")

# Critique and revise for each principle
current_response = initial
for principle in CONSTITUTION:
    print(f"\nPrinciple: {principle['principle']}")
    critique = critique_response(prompt, current_response, principle)
    print(f"Critique: {critique[:150]}...")
    revised = revise_response(prompt, current_response, critique, principle)
    print(f"Revised: {revised[:150]}...")
    current_response = revised

print("\n" + "="*60)
print("FINAL RESPONSE AFTER CONSTITUTIONAL AI:")
print(current_response)

How It Works

  1. generate_response(): Ask the model to respond to a prompt. The response might be problematic.

  2. critique_response(): Ask the model to read its own response and explain whether it violates a specific principle. This uses the model’s ability to reason about text.

  3. revise_response(): Ask the model to rewrite the response to fix the violation. The model now generates a better, more constitutional response.

  4. Loop: Repeat for each principle in the constitution. Each iteration refines the response.

Key Points

  • Transparent: The constitution is written and readable. Anyone can see what principles the AI is following.
  • Scalable: You can run this on millions of prompts in parallel (in practice, you’d use a large cluster).
  • No human judgment of harm: Humans write the principles, AI applies them. Humans don’t have to read harmful content.
  • Consistent: The same principle is applied the same way every time, across all prompts.

To Run on Google Colab

  1. Sign up for free API access at https://console.anthropic.com
  2. Copy the code above into a Colab cell
  3. Replace “YOUR_API_KEY” with your actual API key
  4. Run the cell

The output will show the initial response, critiques for each principle, and the final revised response that follows the constitution.

Limitations

  • This is the “SL-CAI” stage (supervised learning). In the full paper, they also do “RL-CAI” (reinforcement learning) which trains a reward model on AI-generated preferences and optimizes with PPO.
  • The critique quality depends on the model’s understanding of the principles. A principle that is vague will produce vague critiques.
  • This is slower than inference (many forward passes). In practice, you’d do this during training, not at inference time.