Sycophancy / Sycophantic Behavior

Appears in 1 paper

When a model agrees with users even when the user is wrong, in order to be pleasing.

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

When a model agrees with users even when the user is wrong, in order to be pleasing. An alignment failure. InstructGPT still exhibits some sycophancy, but less than GPT-3. Addressing it requires raters to value accuracy over agreeableness.

Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

Appears in papers