Summary: Key Takeaways

One-Sentence Version

Chain-of-thought prompting shows intermediate reasoning steps in few-shot examples, causing large language models (100B+ parameters) to generate their own reasoning before answering, dramatically improving performance on multi-step reasoning tasks.

The Problem

Large language models excel at pattern-matching but fail at multi-step reasoning. GPT-3 (175B parameters) achieved only 17% accuracy on GSM8K (grade-school math) because it was predicting final answers directly without reasoning.

The Idea

Show the model how to solve problems by including intermediate reasoning steps in few-shot examples:

Instead of (question, answer) pairs
Use (question, reasoning chain, answer) triples

The Math

Few-shot setup (standard):

Prompt: {(x₁, y₁), (x₂, y₂), ..., (xₖ, yₖ), x_test}
Model generates: y_test

Few-shot setup (CoT):

Prompt: {(x₁, r₁, y₁), (x₂, r₂, y₂), ..., (xₖ, rₖ, yₖ), x_test}
Model generates: r_test, then y_test

Where r is the reasoning chain.

Key Results

On GSM8K (grade-school math word problems):

Model	Standard	CoT	Improvement
8B	5%	6%	+1%
62B	13%	15%	+2%
540B	25%	58%	+33%

Critical insight: CoT helps dramatically at 100B+ but barely helps at smaller scales. Reasoning emerges only at scale.

The Indian Analogy

Like a student showing their work on a maths exam. A student who just writes the final answer often gets it wrong. A student who writes “Starting with 10, add 5 to get 15, subtract 3 to get 12” catches mistakes and arrives at the correct answer.

Chain-of-thought makes models “show their work.”

Key Numbers to Remember

PaLM 540B with standard prompting: 25% on GSM8K
PaLM 540B with CoT: 58% on GSM8K
Improvement: 33 percentage points (more than 2× better)
Minimum model size for benefit: ~100 billion parameters
Prompt overhead: 2-3× longer (more tokens, more cost)

What Came Next

Zero-Shot CoT (Kojima et al., Feb 2022): “Let’s think step by step” — no examples needed
Self-Consistency (Wang et al., Mar 2022): Sample multiple chains, majority vote
Program-of-Thought (Gao et al., 2022): Use code execution instead of language
Tree-of-Thought (Yao et al., 2023): Explore multiple reasoning paths
Reasoning Models (OpenAI o1, DeepSeek R1, 2024–2025): Test-time compute for reasoning

Limitations

✗ Doesn’t help small models (< 100B)
✗ Makes prompts longer (higher cost)
✗ Reasoning can be unfaithful (looks correct but isn’t)
✗ Not useful for all tasks (only multi-step reasoning)
✗ Requires human-written examples (or zero-shot variant)

Why This Paper Matters

Before CoT: Researchers thought reasoning required new architectures, new training objectives, or symbolic AI.

After CoT: We learned that large models already have reasoning capability. You just need to ask them to show their work.

This simple insight — prompting matters more than we thought, and reasoning emerges at scale — shaped the entire trajectory of LLM development from 2022 to 2025.

Every time you prompt an LLM to “think step by step,” you’re using the insight from this paper.

Glossary

Chain-of-Thought (CoT): A prompting technique where you show intermediate reasoning steps in examples, causing the model to generate reasoning steps before answering.

Emergent Capability: A capability (like reasoning) that only appears above a certain model size threshold, even though the capability wasn’t explicitly trained for.

Few-Shot Prompting: Showing a language model a few examples (typically 2-8) before asking it to solve a new problem, allowing in-context learning.

GSM8K: A benchmark of 8,500 grade-school math word problems, ranging from simple arithmetic to multi-step reasoning.

Unfaithful Reasoning: When a model generates reasoning steps that sound logical but don’t actually match how the model computed the answer.

Zero-Shot CoT: “Let’s think step by step” — asking the model to reason without providing examples.

Self-Consistency: Sampling multiple reasoning chains from the same prompt and taking a majority vote on the answer.

Summary: Key Takeaways

Summary: Key Takeaways

One-Sentence Version

The Problem

The Idea

The Math

Key Results

The Indian Analogy

Key Numbers to Remember

What Came Next

Limitations

Why This Paper Matters

Glossary

Further Reading

What to Read Next