RLHF works, but it has a critical flaw: it requires expensive, inconsistent, and psychologically harmful human judgment at massive scale.
Specific failures observed by 2022:
-
Annotator disagreement on safety: A response that seems helpful to one annotator (e.g., “Here’s how to pick a lock”) seemed unsafe to another (enables theft). The same response got different labels from different people. The reward model learned an inconsistent signal.
-
Disagreement on harmlessness evaluation methodology: Should you label the output as harmful (what it says) or the model as bad (should it have said that)? Different annotators had different criteria. This led to noisy training data.
-
Cultural and personal bias in labelling: Some annotators were more strict on certain topics than others. Annotators from different countries had different thresholds for what counts as disrespectful. The reward model’s notion of “safe” became culturally specific.
-
Content moderation burnout: Teams running RLHF annotation at scale found that asking humans to repeatedly judge harmful content caused serious psychological damage. This led to high turnover, loss of experienced annotators, and ethical concerns.
-
Slow iteration: To improve the model, you need a new round of annotation. That takes weeks. In a fast-moving field, slow iteration means you fall behind.
-
Cost scaling doesn’t match compute scaling: Training compute grows exponentially (Moore’s Law, bigger datasets). But human annotators grow linearly (you can only hire so many people). Eventually, the human bottleneck becomes the limiting factor.
The core insight: You don’t actually need humans to judge whether an output violates your principles. You just need the principles to be written clearly, and then you need an AI to apply those principles consistently.
Humans are good at writing the principles (values, ethics, what matters to the organization). Humans are bad at applying them at scale (inconsistency, bias, burnout).
Why not flip it? Use humans to write the principles once. Use an AI to apply them infinitely many times.
That’s the Constitutional AI idea.