Reward Hacking / Gaming the Reward Model
When the RL policy finds ways to get high reward scores without actually being helpful.
When the RL policy finds ways to get high reward scores without actually being helpful. Examples: generating excessively long responses, using flowery language that sounds impressive but is uninformative, or exploiting edge cases in the reward model. A key limitation of the approach.