Distributional Shift
When the RL policy generates responses very different from the distribution the reward model was trained on.
When the RL policy generates responses very different from the distribution the reward model was trained on. The RM becomes unreliable for out-of-distribution examples, leading to poor learning signals or reward hacking. Addressed by periodically retraining the RM.