Section 07

Limitations: When Chain-of-Thought Fails

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 2022

Limitations: When Chain-of-Thought Fails

Chain-of-thought prompting is powerful but not a silver bullet. Here are the real limitations researchers discovered immediately after this paper was published.

1. Only Works for Large Models

The Problem: CoT shows almost no benefit for models smaller than 100B parameters.

  • 8B parameter model: 5% → 6% (only +1%)
  • 62B parameter model: 13% → 15% (only +2%)
  • 540B parameter model: 25% → 58% (+33%)

If your model is small (say, 7B like Llama-2), CoT won’t help much. You need to either:

  • Upgrade to a larger model (expensive)
  • Fine-tune the small model on CoT examples (requires lots of data)
  • Use knowledge distillation (transfer reasoning from large to small model)

Real-world impact: Many users have resource constraints and can’t access 100B+ models. For them, this paper’s benefits are out of reach.

2. Longer Prompts = More Tokens = Higher Cost

The Problem: Chain-of-thought prompts are significantly longer than standard prompts.

Compare:

Standard prompt:

Q: A train travels 60 km in 2 hours. How far in 5 hours?
A: 150 km

(~15 tokens)

CoT prompt:

Q: A train travels 60 km in 2 hours. How far in 5 hours?
A: Speed = 60 km / 2 hours = 30 km/hour
   Distance = 30 km/hour × 5 hours = 150 km
   Answer: 150 km

(~40 tokens)

CoT uses 2-3× more tokens in the context window. For API-based services (OpenAI, Anthropic, Google), you pay per token. Longer prompts = higher cost.

Example math:

  • Standard prompt: 100 tokens × $0.001/token = $0.10
  • CoT prompt: 250 tokens × $0.001/token = $0.25

A 2.5× cost increase for 3× accuracy gain might be worth it, but it’s a trade-off.

3. Reasoning Chains Can Be Plausible But Wrong

The Problem: The model generates reasoning that looks logical but leads to the wrong answer.

Example:

Q: How many times does the digit 7 appear in numbers from 1 to 100?

Generated reasoning:
"7 appears once in 7, and once in 17, 27, 37, 47, 57, 67, 77, 87, 97.
That's 9 times. But wait, 77 has two 7s, so add 1 more.
So 7 appears 10 times total."

Correct answer: 20 times
(7, 17, 27, 37, 47, 57, 67, 70-79 [ten times], 87, 97)

The model’s reasoning is fluent and plausible — it looks like it’s thinking logically. But it’s wrong. This is called unfaithful reasoning — the intermediate steps don’t actually reflect the model’s computation.

Why this happens: The model learned to generate reasoning text from internet examples (tutorials, explanations). It can generate plausible-looking steps without actually verifying them.

Real-world impact: Users might trust the reasoning chain and implement the wrong solution. It’s especially dangerous for domains like math, medicine, or law where “sounding confident” is not the same as “being correct.”

4. Doesn’t Work Well for All Task Types

The Problem: CoT is specifically designed for multi-step reasoning tasks. It doesn’t help (and sometimes hurts) for other task types.

Tasks where CoT helps:

  • Arithmetic word problems ✓
  • Multi-hop reasoning ✓
  • Logic puzzles ✓
  • Strategic questions ✓

Tasks where CoT doesn’t help:

  • Factual retrieval (“What is the capital of France?”) ✗
  • Sentiment classification (“Is this review positive?”) ✗
  • Information extraction (“Extract the date from this text”) ✗
  • Multiple choice with single facts ✗

For these tasks, the model should just know the answer or find it directly. Adding reasoning steps doesn’t help and wastes tokens.

Real-world impact: You have to decide per-task whether CoT is appropriate. It’s not a universal amplifier.

5. Requires Human-Authored Reasoning Examples

The Problem: To use CoT, you need to write out the reasoning steps for your few-shot examples. This requires human effort.

Example workflow:

  1. Choose 5 representative problems
  2. For each, write out the correct reasoning steps (2-3 minutes per problem)
  3. Test the few-shot prompt
  4. Iterate if needed

For simple domains (math, logic), this is doable. But for specialized domains (biomedical research, legal contracts), you need domain experts to write reasoning chains.

Cost: 5 examples × 3 minutes = 15 minutes of expert time. For 100 examples, that’s 5 hours.

Real-world impact: Making CoT accessible to non-technical domains is hard. Data annotation is expensive.

Workaround: Zero-shot CoT (mentioned in the paper’s follow-ups) — just say “Let’s think step by step” without examples. But this only works for large models and simpler tasks.

6. Self-Consistency Multiplies Inference Cost

The Problem: The paper mentions that sampling multiple reasoning chains and taking a majority vote improves accuracy further.

But here’s the catch: you’re running inference multiple times.

Example:

  • Single reasoning chain: 250 tokens × 1 sample = 250 tokens
  • Multiple chains (5 samples): 250 tokens × 5 = 1250 tokens

That’s 5× inference cost for maybe 1-2% accuracy gain.

Real-world impact: In production systems, inference latency and cost matter. 5× slowdown is often not acceptable.

7. KL Divergence Penalty Requires Tuning (for Fine-Tuning)

The Problem: While this paper is about prompting (no fine-tuning), related work on CoT fine-tuning uses KL divergence penalties to prevent catastrophic forgetting. These penalties require hyperparameter tuning.

Example:

  • KL penalty coefficient β = 0.01: Keeps model close to original, but doesn’t improve much
  • KL penalty coefficient β = 0.1: Allows more change, better improvement, but forgets original knowledge
  • KL penalty coefficient β = 1.0: Model diverges too much, loses base capabilities

Finding the sweet spot requires experimentation and compute.

Real-world impact: Not directly relevant to prompting, but relevant for teams that want to fine-tune models with CoT.

8. Model Capability Ceiling Still Exists

The Problem: CoT doesn’t make a model learn new facts. It only helps it use what it already knows.

Q: What did the Indus Valley Civilization invent?
CoT reasoning: "The Indus Valley was an ancient civilization.
   It was known for... [model doesn't know]
   So probably... [guessing]"

If the model hasn’t learned about the Indus Valley, CoT can’t teach it. CoT is a reasoning amplifier, not a knowledge amplifier.

Real-world impact: You still need pretraining on diverse corpora. CoT amplifies existing knowledge but doesn’t create new knowledge.

Summary Table

LimitationSeverityWorkaround
Requires 100B+ parametersHighUse larger model or fine-tune
Longer prompts, higher costMediumBalance accuracy vs. cost; use shorter examples
Reasoning can be unfaithfulHighVerify reasoning programmatically; use reward models
Not all tasks benefitMediumUse CoT selectively, task-aware routing
Needs human-authored examplesMediumUse zero-shot CoT or automatic annotation
Self-consistency multiplies costMediumUse single chain; accept lower accuracy
Requires hyperparameter tuning (fine-tuning)LowUse defaults or AutoML
No new knowledge creationLowEnsure pretraining is comprehensive

Despite these limitations, the paper’s core finding stands: CoT is a powerful, zero-cost prompting technique for large models on reasoning tasks. The follow-up work addresses several of these limitations (zero-shot CoT, self-consistency, reward modeling, etc.).