Reward Model (RM)
A neural network trained in the second stage of RLHF to predict which of two responses humans prefer.
A neural network trained in the second stage of RLHF to predict which of two responses humans prefer. Takes (prompt, response) pairs and outputs a scalar reward/logit. Trained on human preference comparisons using Bradley-Terry loss. Enables fast reward estimation without human raters in the loop during RL.
A neural network trained to predict how "good" an AI output is. In RLAIF, the reward model learns from AI-generated preferences (which response follows the constitution better). The trained reward model is then used in PPO optimization.