The Math: Coverage Probabilities and Compute Trade-offs — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters

Prerequisite Tutorials

Best-of-N Coverage Probability

The key formula for best-of-N: given N independent attempts, each correct with probability p, what is the probability that at least one is correct?

P(at least one correct | N attempts) = 1 - P(all incorrect)
                                      = 1 - (1 - p)^N

This is the coverage — the probability the verifier sees at least one correct solution among the N tries.

Worked Example 1: MATH Benchmark

Model: GPT-4 on MATH. Single-attempt accuracy: p = 0.42

Suppose we generate N = 10 independent solutions and use a perfect verifier to check them:

P(at least one correct) = 1 - (1 - 0.42)^10
                        = 1 - (0.58)^10
                        = 1 - 0.00324
                        = 0.99676 ≈ 99.7%

So 10 attempts with a perfect verifier turns a 42% model into a 99.7% model.

But this assumes a perfect verifier. In reality, the verifier also makes mistakes.

Coverage with Imperfect Verifier

If the verifier has accuracy v (chance of correctly identifying which solution is right), then:

P(solve | N attempts, verifier accuracy v) 
  = P(at least one correct AND verifier picks it)
  ≈ 1 - (1-p)^N × v + correction terms

Simplified: even with a slightly imperfect verifier (v = 0.95), the coverage drops. For N = 10, p = 0.42:

Perfect verifier:   P ≈ 0.997
v = 0.95 verifier: P ≈ 0.997 × 0.95 ≈ 0.947

This is still excellent, but shows the verifier quality matters.

Sequential Revision Model

For sequential revision, we need a model of how the solution improves with each round.

Simple linear improvement model: After round i, the success probability is:

p_i = p_0 + i × Δp

Where:

p_0 = initial success probability (first attempt)
Δp = improvement per revision round
Maximum p_i ≤ 1.0 (can’t exceed 100%)

Worked Example 2: Sequential Revision

Model: 3.8B LLaMA on MATH. Initial success probability: p_0 = 0.2 (20%)

After each round of critique + revision, the model improves by Δp = 0.05 (5 percentage points).

Round-by-round success probability:

Round 1: p_1 = 0.2 + 0 × 0.05 = 0.2
Round 2: p_2 = 0.2 + 1 × 0.05 = 0.25
Round 3: p_3 = 0.2 + 2 × 0.05 = 0.30
Round 4: p_4 = 0.2 + 3 × 0.05 = 0.35
Round 5: p_5 = 0.2 + 4 × 0.05 = 0.40
Round 10: p_10 = 0.2 + 9 × 0.05 = 0.65

By round 10, the 20% model becomes a 65% model.

Compute Budget Trade-off

Given a fixed token budget B at test time, how do you allocate it between best-of-N and sequential revision?

Best-of-N: Each solution requires approximately K tokens (length of a typical solution). To generate N solutions: B = N × K, so N = B / K.

Sequential revision: Initial solution is K tokens. Each revision round adds roughly K tokens (critique + revised solution). For R rounds: B ≈ K × (1 + R), so R = (B / K) - 1.

Worked Example 3: Compute Budget Allocation

Budget: B = 10,000 tokens. Solution length: K = 1,000 tokens.

Option A (Best-of-N): N = 10,000 / 1,000 = 10 tries

Coverage: 1 - (0.42)^10 ≈ 0.9968 (99.68%)

Option B (Sequential Revision): R = (10,000 / 1,000) - 1 = 9 rounds

Final accuracy: p_0 + 9 × Δp = 0.2 + 9 × 0.05 = 0.65 (65%)

For this problem:

Best-of-N wins (99.68% vs 65%)

Now change the budget to B = 50,000 tokens:

Option A (Best-of-N): N = 50 tries

Coverage: 1 - (0.42)^50 ≈ 0.99999 (essentially 100%)

Option B (Sequential Revision): R = 49 rounds

Final accuracy: 0.2 + 49 × 0.05 = 2.65 (capped at 1.0 = 100%)

The model saturates — it can’t improve beyond 100% accuracy. But in this regime, both strategies reach near-perfect accuracy with the same budget.

Cost-Benefit Analysis: When Sequential Revision Wins

For a hard problem with p_0 = 0.1 (10%), Δp = 0.05:

Best-of-N Analysis:

N = 1:   P = 0.1
N = 2:   P = 1 - 0.9^2 = 0.19
N = 5:   P = 1 - 0.9^5 = 0.409
N = 10:  P = 1 - 0.9^10 = 0.652
N = 20:  P = 1 - 0.9^20 = 0.878
N = 50:  P = 1 - 0.9^50 = 0.995

Sequential Revision Analysis:

Round 1:  P = 0.1
Round 2:  P = 0.15
Round 5:  P = 0.30
Round 10: P = 0.55
Round 20: P = 1.0 (saturates)

To reach 65% accuracy:

Best-of-N: need N ≈ 10 tries
Sequential Revision: need R ≈ 10 rounds

Same number of rounds! But the compute is different:

Best-of-N: 10 forward passes, each generating a full solution
Sequential Revision: 10 critique + revision rounds (might be cheaper per round if critique is shorter)

The paper’s insight: sequential revision scales better for hard problems because each refinement directly targets the weakness, whereas best-of-N is sampling randomly.

Key Equations Summary

Best-of-N coverage:
  P_N = 1 - (1 - p)^N

Sequential revision (linear model):
  p_r = p_0 + r × Δp (capped at 1.0)

Compute budget constraint:
  B = N × K (best-of-N)
  B ≈ K × (1 + R) (sequential revision)

Optimal strategy:
  If (1 - p)^N > p_0 + r × Δp for same budget → use best-of-N
  If p_0 + r × Δp > (1 - (1-p)^N) for same budget → use sequential revision

The paper shows empirically that for MATH and hard reasoning tasks, sequential revision (and variants like search) is compute-optimal.