MATH Benchmark

Appears in 2 papers

A dataset of 12,500 competition-level math problems from AMC (American Mathematics Competitions) and AIME (American Invitational Mathematics Examination).

As used in Paper 23 — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters →

A dataset of 12,500 competition-level math problems from AMC (American Mathematics Competitions) and AIME (American Invitational Mathematics Examination). A standard benchmark for evaluating reasoning in LLMs. The paper's main experimental domain.

As used in Paper 24 — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking →

A dataset of 12,500 competition-level math problems from AMC and AIME exams. The standard benchmark for evaluating mathematical reasoning in language models.

Paper 23 — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters → Paper 24 — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking →

Appears in papers