MATH Benchmark
A dataset of 12,500 competition-level math problems from AMC (American Mathematics Competitions) and AIME (American Invitational Mathematics Examination).
A dataset of 12,500 competition-level math problems from AMC (American Mathematics Competitions) and AIME (American Invitational Mathematics Examination). A standard benchmark for evaluating reasoning in LLMs. The paper's main experimental domain.
A dataset of 12,500 competition-level math problems from AMC and AIME exams. The standard benchmark for evaluating mathematical reasoning in language models.