Policy Gradient / Policy Optimization
RL algorithms that improve a policy (probability distribution) by taking gradient steps that increase expected reward.
RL algorithms that improve a policy (probability distribution) by taking gradient steps that increase expected reward. PPO is an example. In this paper, used to optimize the language model policy to maximize reward from the RM while staying close to SFT via KL penalty.