Supervised Fine-Tuning (SFT)
The first stage of RLHF.
The first stage of RLHF. Fine-tune a pretrained model (e.g., GPT-3) on human-written examples of good behavior using standard cross-entropy loss. Result: a model that follows instructions better than the base model, but hasn't yet learned to optimize for human preferences.
Training a model on high-quality examples using standard cross-entropy loss. The model learns to generate outputs similar to the training examples. Used in rStar-Math to train on MCTS-generated solutions.