Section 07

Limitations of Scaling Laws

Scaling Laws for Neural Language Models 2020

Limitations of Scaling Laws

The scaling laws are powerful, but they have real limitations. Understanding them prevents overconfidence.

Limitation 1: The Exponents Are Approximate

The paper found α_N ≈ 0.076, but this is an average across experiments. Individual runs vary:

  • Some models: α_N = 0.070
  • Some models: α_N = 0.082

When predicting at very large scales (1 trillion parameters), small errors in α compound. A 0.006 difference in the exponent can lead to 10–20% errors in predicted loss.

Impact: The laws are reliable for planning, but don’t trust a single predicted number. Use error bands (confidence intervals).

Limitation 2: The Laws Don’t Account for Data Quality

The formulas assume all tokens are equally useful:

L(D) = a * D^(-0.103)

But in reality, 1 trillion tokens of high-quality Wikipedia text ≠ 1 trillion tokens of garbage text.

Real-world example:

  • 100B tokens of random internet text (low quality)
  • 50B tokens of books + structured data (high quality)

A model trained on 50B high-quality tokens might outperform one trained on 100B low-quality tokens. The laws don’t capture this.

Impact: You can’t just collect any 1 trillion tokens. Data curation matters. The laws provide a lower bound; good data can beat predictions.

Limitation 3: The Laws Don’t Account for Different Architectures

The paper focused on decoder-only Transformers. Do the same laws hold for:

  • Encoder-only models (BERT)?
  • Encoder-decoder models (T5)?
  • Mixture-of-Experts models?
  • Attention-free models (State Space Models, RNNs)?

Partial evidence: Scaling laws seem to generalize across Transformers, but the exponents (α_N, α_D) might differ slightly for different architectures.

Impact: If you switch to a different architecture, the predicted losses might be off. You’d need to re-calibrate on that architecture.

Limitation 4: The Laws Assume Dense Training

The compute formula C ≈ 6 * N * D assumes dense models (all parameters used in every forward pass). Modern models use:

  • Mixture-of-Experts (MoE): Only a fraction of parameters are active per forward pass. Actual compute is lower.
  • Sparse Attention: Not all tokens attend to all tokens. Compute is lower.
  • Quantization: Use lower-precision numbers. Compute is lower.

For these architectures, C ≈ 6 * N * D doesn’t hold exactly.

Impact: If you use MoE or sparse architectures, the effective compute is lower than the formula suggests. You need architecture-specific constants.

Limitation 5: Compute-Optimal Ratio Was Later Refined (Chinchilla, 2022)

This paper found: “Optimal ratio is N ∝ C^0.73, D ∝ C^0.27.”

But DeepMind’s Chinchilla paper (2022) re-examined this and found the optimal ratio is closer to:

  • N ∝ C^0.67
  • D ∝ C^0.33

Implication: Allocate more data relative to parameters than this paper suggested. GPT-3 was even more compute-suboptimal than previously thought.

Impact: This paper’s design decisions were superseded. But the methodology (run experiments, fit power laws) remains valid.

Limitation 6: Loss Doesn’t Directly Predict Benchmark Performance

The scaling laws predict loss (cross-entropy on the test set). But real task performance (accuracy on classification, BLEU on translation, correctness on math) doesn’t scale smoothly with loss.

Example:

Loss 1.5 bits per token → 60% accuracy on sentiment
Loss 1.4 bits per token → 65% accuracy
Loss 1.3 bits per token → 75% accuracy (phase transition!)
Loss 1.2 bits per token → 76% accuracy (returns to smooth)

Sometimes there are emergent abilities or phase transitions where performance jumps suddenly at a certain scale. The smooth loss curve doesn’t capture these.

Impact: Predicting loss is useful but incomplete. For specific tasks, you need task-specific benchmarks.

Limitation 7: The Laws Break at Extreme Scales

The experiments went up to ~100B parameters. But what about 1 trillion? 10 trillion?

Unknown unknowns:

  • Do the laws still hold?
  • Do new phenomena (phase transitions, different optimal ratios) emerge?
  • Does the dataset become the bottleneck (we run out of text)?

At extreme scales, the assumptions might break.

Impact: The laws are trustworthy up to ~200B parameters (verified empirically). Beyond that, extrapolation is riskier.

Limitation 8: Doesn’t Account for Inference Cost

The paper focuses on pre-training compute (C ≈ 6 * N * D). But once trained, the model must be run at inference (serving users).

Inference compute is very different:

  • A 175B parameter model is expensive to run in production
  • Smaller models with more data might be cheaper to serve

The laws don’t optimize for total cost (pre-training + inference). A compute-optimal training configuration might be sub-optimal for deployment.

Impact: In practice, you might want a smaller, faster model even if it trains less efficiently. The laws are pre-training-centric.


Key Takeaways from This Section

  • Exponents are approximate: Use confidence bands, not point estimates.
  • Data quality matters: The laws assume all tokens are equal; they’re not.
  • Architecture-dependent: Different Transformer variants might have different exponents.
  • Dense training assumed: MoE and sparse models don’t follow the C ≈ 6ND formula exactly.
  • Chinchilla refined the ratio: Allocate more data than this paper suggested.
  • Loss ≠ task performance: Loss is smooth, but benchmark performance can have phase transitions.
  • Breaks at extremes: The laws are verified up to ~100B; extrapolation beyond is uncertain.
  • Doesn’t optimize inference: Pre-training efficiency ≠ deployment efficiency.

These limitations don’t invalidate the scaling laws. They just clarify their scope and assumptions.

Next: Section 08: Impact