Value Function / Baseline

Appears in 1 paper

In RL, an estimate of expected future reward used to reduce gradient variance.

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

In RL, an estimate of expected future reward used to reduce gradient variance. In RLHF, a learned function V(prompt) estimates expected reward given a prompt, helping compute advantages. Reduces noise in policy gradient estimates, improving training stability.