Section 01

Context: The Scaling Law Paradigm

Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters 2024

By 2023, the AI field had a clear understanding of scaling laws: bigger models are better models. Papers like “Scaling Laws for Neural Language Models” (Hoffmann et al., 2022) showed that:

  1. Model loss decreases predictably with model size (more parameters = lower loss)
  2. Model loss decreases predictably with training data size (more tokens = lower loss)
  3. Model loss decreases predictably with training compute (more FLOPs = lower loss)

The three factors (parameters, data, compute) scale together predictably. The formula was: loss = a·N^(-α) + b·D^(-β) + c·C^(-γ), where N is parameters, D is data, C is compute, and α, β, γ are constants around 0.07 to 0.1.

This led to a straightforward strategy: to improve your model, make it bigger and train on more data with more compute.

OpenAI’s scaling strategy: GPT-2 (1.5B) → GPT-3 (175B) → GPT-4 (~1.7T, rumored). Each step, bigger model, more data, more training compute.

Meta’s strategy: LLaMA (7B, 13B, 65B, 70B). Pushing the parameter count.

Google’s strategy: PaLM (540B). DeepMind + Google: Gemini (competing on scale and task coverage).

The narrative was: “Scaling is all you need.” Just make the models bigger, and they’ll get smarter at everything.

By 2024, this narrative was still dominant, but a few cracks were showing:

  1. Diminishing returns: Scaling from 1B to 7B gave huge improvements. Scaling from 7B to 70B was good. Scaling from 70B to 1.7T was measurable but smaller gains per dollar spent.

  2. Inference time is critical: Training a 70B model takes weeks. Inference with a 70B model is expensive. The cost per inference grows with model size. For many applications, you can’t afford inference on the largest models.

  3. Some problems resist pure scaling: Competitive math problems, hard reasoning tasks, and complex multi-step problems still defeated even the largest models at the time (GPT-4, PaLM 2). You could scale up to 100B parameters and still get only 40% accuracy on MATH.

The question: If scaling model size isn’t working for hard problems, what else can we do?

This paper’s answer: Stop thinking about making bigger models. Start thinking about making models that spend more time thinking.

The insight is profound: Test-time compute (compute spent at inference time) is a separate scaling axis from training-time compute.

This shifts the research agenda from “make bigger models” to “make models that can solve problems better by thinking longer.”