Reward Hacking / Gaming the Reward Model

Appears in 1 paper

When the RL policy finds ways to get high reward scores without actually being helpful.

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

When the RL policy finds ways to get high reward scores without actually being helpful. Examples: generating excessively long responses, using flowery language that sounds impressive but is uninformative, or exploiting edge cases in the reward model. A key limitation of the approach.

Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

Appears in papers