By 2024, scaling had hit diminishing returns on hard reasoning tasks.
Benchmark: MATH (12,500 competition-level mathematics problems from AMC, AIME, etc.)
Results in 2024:
- GPT-4: ~42% accuracy
- Claude 3 Opus: ~45% accuracy
- Best open-source models (LLaMA 70B): ~25% accuracy
- Random guess: ~0% (problems are hard, multiple choice typically has 4–5 options)
The ceiling: Even the largest models in the world couldn’t get past ~50% accuracy on competition math. That’s still failing half the time.
Why can’t we scale out of this problem?
-
Bigger models don’t help much anymore: Going from 70B to 100B parameters gives maybe a 2–3% improvement on MATH. The scaling exponent has flattened.
-
More training data alone doesn’t help: There’s only so much high-quality mathematical training data available. You can’t just throw more data at the problem.
-
The problem requires reasoning, not memorization: Competition math problems require multi-step reasoning, backtracking on failed approaches, and verifying solutions. These are not things that scale with memorized patterns.
A concrete example:
Problem: “A circle is inscribed in a triangle with sides 5, 6, and 7. Find the radius of the inscribed circle.”
A model can either:
- Remember the formula for inradius: r = Area / s, where s = (a+b+c)/2 (memorization).
- Derive it from scratch (reasoning).
If the problem is slightly different (e.g., a different triangle, or a parabola instead of circle), pure memorization fails.
Why sequential revision or best-of-N might help:
If a single attempt has success probability p = 0.4 (GPT-4 on MATH), then:
- Best-of-10: success probability = 1 - (0.6)^10 ≈ 0.994 (99.4%)
- But you need a perfect verifier that can judge whether each answer is correct
If the model can refine its reasoning:
- Attempt 1: “I’ll use the inradius formula…”
- Attempt 2: “Wait, I made an error. Let me recalculate using Heron’s formula for area…”
- Attempt 3: “That’s better. Let me verify by checking if the circle actually fits inside the triangle…”
The insight: The problem isn’t that the model is stupid. It’s that the model doesn’t have time to think — to try multiple approaches, to critique its own reasoning, to refine.
Test-time compute gives the model this luxury.