Fine-tuning

Appears in 4 papers

Adapting a pre-trained model to a specific task by continuing training on labelled task data with a small learning rate.

As used in Paper 10 — Improving Language Understanding by Generative Pre-Training →

Adapting a pre-trained model to a specific task by continuing training on labelled task data with a small learning rate. Updates all model weights (not just the classification head).

As used in Paper 11 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding →

The second phase of BERT's training. The pre-trained model is loaded and trained further on a small labelled dataset for a specific task (e.g. sentiment analysis). All parameters — including the Transformer encoder — are updated, but the learning rate is kept very small (2e-5 to 5e-5) to avoid catastrophic forgetting.

As used in Paper 12 — Language Models are Few-Shot Learners →

Training a pre-trained model on labeled data specific to a particular task. In traditional NLP (before GPT-3), every new task required fine-tuning. GPT-3 replaced fine-tuning with prompt engineering + in-context learning.

As used in Paper 13 — Scaling Laws for Neural Language Models →

Training a pre-trained model on smaller amounts of labeled data specific to a downstream task. The scaling laws focus on pre-training; fine-tuning uses a different compute profile.