Benchmark

Appears in 1 paper

Standardised tests for evaluating language model quality (e.g., MMLU for general reasoning, GSM8k for math, HumanEval for coding).

As used in Paper 18 — Mistral 7B →

Standardised tests for evaluating language model quality (e.g., MMLU for general reasoning, GSM8k for math, HumanEval for coding). Mistral 7B outperforms LLaMA 2 13B on most benchmarks, surprising many who assumed bigger = better. Benchmarks don't capture all aspects of model quality, but they provide objective comparison points.