Cross-Entropy Loss

Appears in 2 papers

The standard loss function for language modeling.

As used in Paper 12 — Language Models are Few-Shot Learners →

The standard loss function for language modeling. Measures how well the model's predicted probability distribution matches the true distribution (i.e., the correct next token). Lower loss = better model.

As used in Paper 13 — Scaling Laws for Neural Language Models →

The standard metric for language models. Measures how well the model's predicted probability distribution matches the true distribution. Lower loss = better model. Ranges from 0 (perfect) to infinity (terrible).