Policy Gradient / Policy Optimization

Appears in 1 paper

RL algorithms that improve a policy (probability distribution) by taking gradient steps that increase expected reward.

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

RL algorithms that improve a policy (probability distribution) by taking gradient steps that increase expected reward. PPO is an example. In this paper, used to optimize the language model policy to maximize reward from the RM while staying close to SFT via KL penalty.

Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

Appears in papers