Distributional Shift

Appears in 1 paper

When the RL policy generates responses very different from the distribution the reward model was trained on.

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

When the RL policy generates responses very different from the distribution the reward model was trained on. The RM becomes unreliable for out-of-distribution examples, leading to poor learning signals or reward hacking. Addressed by periodically retraining the RM.