The Idea: Best-of-N vs. Sequential Revision — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters

Instead of training a bigger model, let the model solve the same problem multiple times or refine its solution iteratively. This is test-time compute: using computation at inference time, not training time.

Strategy 1: Best-of-N

Intuition: If you’re not sure about an answer, try multiple times and pick the best one.

Process:

Generate N independent solutions to the same problem (by sampling from the model with temperature > 0, or just running the model N times).
Use a verifier (Process Reward Model from Paper 16, or a simple correctness checker) to score each solution.
Return the solution with the highest score.

Why it works: If each solution is correct with probability p, then at least one is correct with probability 1 - (1-p)^N.

Example: If p = 0.4 (40% chance each solution is correct):

N = 1: P(at least one correct) = 0.4
N = 2: P(at least one correct) = 1 - 0.6^2 = 0.64
N = 5: P(at least one correct) = 1 - 0.6^5 = 0.922
N = 10: P(at least one correct) = 1 - 0.6^10 = 0.994

So with just 10 tries, a model with 40% single-attempt accuracy becomes 99.4% accurate (assuming the verifier is perfect).

Compute cost: N forward passes, each generating a full solution.

Bottleneck: The verifier must be accurate. If the verifier scores the wrong solution as best, you pick a wrong answer.

Strategy 2: Sequential Revision

Intuition: Instead of trying multiple independent solutions, refine one solution iteratively.

Process:

Generate an initial solution.
Ask the model to critique the solution (or use a separate critic model).
Ask the model to revise the solution based on the critique.
Repeat steps 2–3 for K rounds.
Return the final revised solution.

Why it works: The model learns from feedback and improves with each round.

Example: Starting p = 0.1 (10% chance of solving first try), with improvement of Δp = 0.05 per round:

Round 1: P = 0.1
Round 2: P = 0.15
Round 3: P = 0.2
Round 5: P = 0.3
Round 10: P = 0.55

By round 10, a 10% model becomes a 55% model through iterative refinement.

Compute cost: K forward passes (one per round), where K is the number of revision rounds. Each forward pass is for critique + revision, which can be longer than a single solution.

Bottleneck: The model must actually improve with feedback. If it repeats the same mistakes or doesn’t understand the critique, iterative refinement fails.

When Is Each Strategy Better?

The paper shows that the compute-optimal strategy depends on task difficulty:

For easy problems (p_initial = 0.8):

Best-of-N: Generate 5 tries, pick best. Success probability ≈ 0.997.
Sequential revision: Refine same solution. Probably already at 0.8, refinement adds little.
Winner: Best-of-N (cheaper, sampling beats refinement when you’re already good)

For hard problems (p_initial = 0.1):

Best-of-N: Generate 50 tries to reach 99%+ accuracy. Expensive (50 forward passes).
Sequential revision: Refine 10 times, improve from 0.1 to 0.55. Much cheaper (10 forward passes) and you reach reasonable accuracy.
Winner: Sequential revision (refinement beats repeated sampling when you’re struggling)

The optimal strategy is problem-dependent: Given a fixed token budget at inference time, you must decide: do I sample more or think deeper?

The Indian Analogy

You’re taking an entrance exam. You have limited time.

For an easy multiple-choice question: Best-of-N strategy. Read the question, quickly generate a few possible answers in your head, eliminate the bad ones, pick the best. Don’t spend time refining — just go with your gut.

For a hard essay question: Sequential revision strategy. Don’t try writing three completely different essays. Write one essay, then read it over, fix errors, think about counterarguments, revise. Spend time on one thoughtful answer.

The two strategies are optimal for different types of problems. The exam setter (the AI lab) must choose which type of problem to give, or the solver must choose which strategy to spend time on.

Connection to System 1 vs. System 2

Daniel Kahneman’s framework from his book “Thinking, Fast and Slow”:

System 1 (fast, intuitive): Generates quick answers without much thought.
System 2 (slow, deliberate): Reasons carefully through a problem.

Best-of-N is like asking System 1 many times and taking a vote. Sequential revision is like asking System 2 to think deeply once.

This paper shows: System 2 (think longer) beats System 1 (think faster, many times) on hard problems. The frontier moves from “fast, parallel thinking” to “slow, deep thinking.”

Summary

Best-of-N: Try multiple independent solutions, pick the best. Good for easy problems where sampling diversity helps.
Sequential revision: Refine one solution iteratively with feedback. Good for hard problems where thinking deeper helps.
Trade-off: Given a fixed token budget, you must choose between breadth (N tries) and depth (K rounds of refinement).
The insight: Test-time compute is as important as model size on hard tasks. A smaller model given more time can beat a larger model given less time.