KL Divergence Penalty

Appears in 1 paper

A regularization term in the RL objective that constrains the policy to stay close to the SFT baseline: β · KL[π_RL || π_SFT].

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

A regularization term in the RL objective that constrains the policy to stay close to the SFT baseline: β · KL[π_RL || π_SFT]. Prevents the RL policy from diverging too far in pursuit of reward, avoiding catastrophic forgetting of pretraining knowledge and reducing reward hacking. β ≈ 0.01–0.1 is typical.