The Problem: Scaling Laws Plateau on Hard Tasks — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters

By 2024, scaling had hit diminishing returns on hard reasoning tasks.

Benchmark: MATH (12,500 competition-level mathematics problems from AMC, AIME, etc.)

Results in 2024:

GPT-4: ~42% accuracy
Claude 3 Opus: ~45% accuracy
Best open-source models (LLaMA 70B): ~25% accuracy
Random guess: ~0% (problems are hard, multiple choice typically has 4–5 options)

The ceiling: Even the largest models in the world couldn’t get past ~50% accuracy on competition math. That’s still failing half the time.

Why can’t we scale out of this problem?

Bigger models don’t help much anymore: Going from 70B to 100B parameters gives maybe a 2–3% improvement on MATH. The scaling exponent has flattened.
More training data alone doesn’t help: There’s only so much high-quality mathematical training data available. You can’t just throw more data at the problem.
The problem requires reasoning, not memorization: Competition math problems require multi-step reasoning, backtracking on failed approaches, and verifying solutions. These are not things that scale with memorized patterns.

A concrete example:

Problem: “A circle is inscribed in a triangle with sides 5, 6, and 7. Find the radius of the inscribed circle.”

A model can either:

Remember the formula for inradius: r = Area / s, where s = (a+b+c)/2 (memorization).
Derive it from scratch (reasoning).

If the problem is slightly different (e.g., a different triangle, or a parabola instead of circle), pure memorization fails.

Why sequential revision or best-of-N might help:

If a single attempt has success probability p = 0.4 (GPT-4 on MATH), then:

Best-of-10: success probability = 1 - (0.6)^10 ≈ 0.994 (99.4%)
But you need a perfect verifier that can judge whether each answer is correct

If the model can refine its reasoning:

Attempt 1: “I’ll use the inradius formula…”
Attempt 2: “Wait, I made an error. Let me recalculate using Heron’s formula for area…”
Attempt 3: “That’s better. Let me verify by checking if the circle actually fits inside the triangle…”

The insight: The problem isn’t that the model is stupid. It’s that the model doesn’t have time to think — to try multiple approaches, to critique its own reasoning, to refine.

Test-time compute gives the model this luxury.