Off-Policy vs. On-Policy RL

Appears in 1 paper

Off-policy: Learning from data generated by other policies (e.g., supervised data).

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

Off-policy: Learning from data generated by other policies (e.g., supervised data). On-policy: Learning from data generated by the current policy. RLHF is mostly on-policy (policy generates its own rollouts), with off-policy elements (SFT data from other sources).