Below is Python code demonstrating the core critique-revision loop of Constitutional AI. This code runs on Google Colab using the Anthropic API (free tier allows this).
Installation
# In Colab: !pip install anthropic
import anthropic
Full Example: Constitutional AI Pipeline
import anthropic
# Initialize the API client (get key from https://console.anthropic.com)
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
# Define the constitution - principles the AI should follow
CONSTITUTION = [
{"principle": "Honesty",
"desc": "Be honest, not deliberately misleading"},
{"principle": "Harm prevention",
"desc": "Avoid advice that could cause harm"},
{"principle": "Helpfulness",
"desc": "Be genuinely helpful to the human"}
]
def generate_response(prompt):
"""Generate initial response to a prompt."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def critique_response(prompt, response, principle):
"""Ask the model to critique response against principle."""
critique_prompt = f"""The human asked: {prompt}
The AI responded: {response}
Principle: {principle['principle']} - {principle['desc']}
Does this response violate the principle? Explain briefly."""
result = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=150,
messages=[{"role": "user", "content": critique_prompt}]
)
return result.content[0].text
def revise_response(prompt, response, critique, principle):
"""Ask model to revise response based on critique."""
revision_prompt = f"""Original prompt: {prompt}
Original response: {response}
Critique: {critique}
Principle to follow: {principle['principle']}
Rewrite the response to address the critique while staying helpful."""
result = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{"role": "user", "content": revision_prompt}]
)
return result.content[0].text
# Example: Constitutional AI on a prompt
prompt = "How can I convince my friend to invest money without telling them risks?"
print("INITIAL RESPONSE:")
initial = generate_response(prompt)
print(initial[:200] + "...")
print("\n" + "="*60)
print("APPLYING CONSTITUTION:")
# Critique and revise for each principle
current_response = initial
for principle in CONSTITUTION:
print(f"\nPrinciple: {principle['principle']}")
critique = critique_response(prompt, current_response, principle)
print(f"Critique: {critique[:150]}...")
revised = revise_response(prompt, current_response, critique, principle)
print(f"Revised: {revised[:150]}...")
current_response = revised
print("\n" + "="*60)
print("FINAL RESPONSE AFTER CONSTITUTIONAL AI:")
print(current_response)
How It Works
-
generate_response(): Ask the model to respond to a prompt. The response might be problematic.
-
critique_response(): Ask the model to read its own response and explain whether it violates a specific principle. This uses the model’s ability to reason about text.
-
revise_response(): Ask the model to rewrite the response to fix the violation. The model now generates a better, more constitutional response.
-
Loop: Repeat for each principle in the constitution. Each iteration refines the response.
Key Points
- Transparent: The constitution is written and readable. Anyone can see what principles the AI is following.
- Scalable: You can run this on millions of prompts in parallel (in practice, you’d use a large cluster).
- No human judgment of harm: Humans write the principles, AI applies them. Humans don’t have to read harmful content.
- Consistent: The same principle is applied the same way every time, across all prompts.
To Run on Google Colab
- Sign up for free API access at https://console.anthropic.com
- Copy the code above into a Colab cell
- Replace “YOUR_API_KEY” with your actual API key
- Run the cell
The output will show the initial response, critiques for each principle, and the final revised response that follows the constitution.
Limitations
- This is the “SL-CAI” stage (supervised learning). In the full paper, they also do “RL-CAI” (reinforcement learning) which trains a reward model on AI-generated preferences and optimizes with PPO.
- The critique quality depends on the model’s understanding of the principles. A principle that is vague will produce vague critiques.
- This is slower than inference (many forward passes). In practice, you’d do this during training, not at inference time.