The Code: Constitutional AI Critique-Revision Loop — Constitutional AI: Harmlessness from AI Feedback

Below is Python code demonstrating the core critique-revision loop of Constitutional AI. This code runs on Google Colab using the Anthropic API (free tier allows this).

Installation

# In Colab: !pip install anthropic
import anthropic

Full Example: Constitutional AI Pipeline

import anthropic

# Initialize the API client (get key from https://console.anthropic.com)
client = anthropic.Anthropic(api_key="YOUR_API_KEY")

# Define the constitution - principles the AI should follow
CONSTITUTION = [
    {"principle": "Honesty", 
     "desc": "Be honest, not deliberately misleading"},
    {"principle": "Harm prevention", 
     "desc": "Avoid advice that could cause harm"},
    {"principle": "Helpfulness", 
     "desc": "Be genuinely helpful to the human"}
]

def generate_response(prompt):
    """Generate initial response to a prompt."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

def critique_response(prompt, response, principle):
    """Ask the model to critique response against principle."""
    critique_prompt = f"""The human asked: {prompt}

The AI responded: {response}

Principle: {principle['principle']} - {principle['desc']}

Does this response violate the principle? Explain briefly."""
    
    result = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=150,
        messages=[{"role": "user", "content": critique_prompt}]
    )
    return result.content[0].text

def revise_response(prompt, response, critique, principle):
    """Ask model to revise response based on critique."""
    revision_prompt = f"""Original prompt: {prompt}

Original response: {response}

Critique: {critique}

Principle to follow: {principle['principle']}

Rewrite the response to address the critique while staying helpful."""
    
    result = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{"role": "user", "content": revision_prompt}]
    )
    return result.content[0].text

# Example: Constitutional AI on a prompt
prompt = "How can I convince my friend to invest money without telling them risks?"

print("INITIAL RESPONSE:")
initial = generate_response(prompt)
print(initial[:200] + "...")

print("\n" + "="*60)
print("APPLYING CONSTITUTION:")

# Critique and revise for each principle
current_response = initial
for principle in CONSTITUTION:
    print(f"\nPrinciple: {principle['principle']}")
    critique = critique_response(prompt, current_response, principle)
    print(f"Critique: {critique[:150]}...")
    revised = revise_response(prompt, current_response, critique, principle)
    print(f"Revised: {revised[:150]}...")
    current_response = revised

print("\n" + "="*60)
print("FINAL RESPONSE AFTER CONSTITUTIONAL AI:")
print(current_response)

How It Works

generate_response(): Ask the model to respond to a prompt. The response might be problematic.
critique_response(): Ask the model to read its own response and explain whether it violates a specific principle. This uses the model’s ability to reason about text.
revise_response(): Ask the model to rewrite the response to fix the violation. The model now generates a better, more constitutional response.
Loop: Repeat for each principle in the constitution. Each iteration refines the response.

Key Points

Transparent: The constitution is written and readable. Anyone can see what principles the AI is following.
Scalable: You can run this on millions of prompts in parallel (in practice, you’d use a large cluster).
No human judgment of harm: Humans write the principles, AI applies them. Humans don’t have to read harmful content.
Consistent: The same principle is applied the same way every time, across all prompts.

To Run on Google Colab

Sign up for free API access at https://console.anthropic.com
Copy the code above into a Colab cell
Replace “YOUR_API_KEY” with your actual API key
Run the cell

The output will show the initial response, critiques for each principle, and the final revised response that follows the constitution.

Limitations

This is the “SL-CAI” stage (supervised learning). In the full paper, they also do “RL-CAI” (reinforcement learning) which trains a reward model on AI-generated preferences and optimizes with PPO.
The critique quality depends on the model’s understanding of the principles. A principle that is vague will produce vague critiques.
This is slower than inference (many forward passes). In practice, you’d do this during training, not at inference time.