GSM8K

Appears in 1 paper

A benchmark dataset of 8,500 grade-school math word problems, ranging from simple arithmetic to multi-step reasoning.

As used in Paper 14 — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models →

A benchmark dataset of 8,500 grade-school math word problems, ranging from simple arithmetic to multi-step reasoning. Problems require 2-8 reasoning steps and often include irrelevant distractors. GSM8K was a key evaluation benchmark in the CoT paper and became standard for measuring reasoning capabilities of large language models.