Section 09

Summary: Key Takeaways

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 2022

Summary: Key Takeaways

One-Sentence Version

Chain-of-thought prompting shows intermediate reasoning steps in few-shot examples, causing large language models (100B+ parameters) to generate their own reasoning before answering, dramatically improving performance on multi-step reasoning tasks.

The Problem

Large language models excel at pattern-matching but fail at multi-step reasoning. GPT-3 (175B parameters) achieved only 17% accuracy on GSM8K (grade-school math) because it was predicting final answers directly without reasoning.

The Idea

Show the model how to solve problems by including intermediate reasoning steps in few-shot examples:

  • Instead of (question, answer) pairs
  • Use (question, reasoning chain, answer) triples

The Math

Few-shot setup (standard):

Prompt: {(x₁, y₁), (x₂, y₂), ..., (xₖ, yₖ), x_test}
Model generates: y_test

Few-shot setup (CoT):

Prompt: {(x₁, r₁, y₁), (x₂, r₂, y₂), ..., (xₖ, rₖ, yₖ), x_test}
Model generates: r_test, then y_test

Where r is the reasoning chain.

Key Results

On GSM8K (grade-school math word problems):

ModelStandardCoTImprovement
8B5%6%+1%
62B13%15%+2%
540B25%58%+33%

Critical insight: CoT helps dramatically at 100B+ but barely helps at smaller scales. Reasoning emerges only at scale.

The Indian Analogy

Like a student showing their work on a maths exam. A student who just writes the final answer often gets it wrong. A student who writes “Starting with 10, add 5 to get 15, subtract 3 to get 12” catches mistakes and arrives at the correct answer.

Chain-of-thought makes models “show their work.”

Key Numbers to Remember

  • PaLM 540B with standard prompting: 25% on GSM8K
  • PaLM 540B with CoT: 58% on GSM8K
  • Improvement: 33 percentage points (more than 2× better)
  • Minimum model size for benefit: ~100 billion parameters
  • Prompt overhead: 2-3× longer (more tokens, more cost)

What Came Next

  1. Zero-Shot CoT (Kojima et al., Feb 2022): “Let’s think step by step” — no examples needed
  2. Self-Consistency (Wang et al., Mar 2022): Sample multiple chains, majority vote
  3. Program-of-Thought (Gao et al., 2022): Use code execution instead of language
  4. Tree-of-Thought (Yao et al., 2023): Explore multiple reasoning paths
  5. Reasoning Models (OpenAI o1, DeepSeek R1, 2024–2025): Test-time compute for reasoning

Limitations

  • ✗ Doesn’t help small models (< 100B)
  • ✗ Makes prompts longer (higher cost)
  • ✗ Reasoning can be unfaithful (looks correct but isn’t)
  • ✗ Not useful for all tasks (only multi-step reasoning)
  • ✗ Requires human-written examples (or zero-shot variant)

Why This Paper Matters

Before CoT: Researchers thought reasoning required new architectures, new training objectives, or symbolic AI.

After CoT: We learned that large models already have reasoning capability. You just need to ask them to show their work.

This simple insight — prompting matters more than we thought, and reasoning emerges at scale — shaped the entire trajectory of LLM development from 2022 to 2025.

Every time you prompt an LLM to “think step by step,” you’re using the insight from this paper.


Glossary

Chain-of-Thought (CoT): A prompting technique where you show intermediate reasoning steps in examples, causing the model to generate reasoning steps before answering.

Emergent Capability: A capability (like reasoning) that only appears above a certain model size threshold, even though the capability wasn’t explicitly trained for.

Few-Shot Prompting: Showing a language model a few examples (typically 2-8) before asking it to solve a new problem, allowing in-context learning.

GSM8K: A benchmark of 8,500 grade-school math word problems, ranging from simple arithmetic to multi-step reasoning.

Unfaithful Reasoning: When a model generates reasoning steps that sound logical but don’t actually match how the model computed the answer.

Zero-Shot CoT: “Let’s think step by step” — asking the model to reason without providing examples.

Self-Consistency: Sampling multiple reasoning chains from the same prompt and taking a majority vote on the answer.


Further Reading

Original paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Wei et al., NeurIPS 2022

Key follow-ups:

Blog resources:

  • Hugging Face’s summary: “Chain-of-Thought Prompting”
  • OpenAI’s documentation on chain-of-thought in GPT-4
  • Anthropic’s work on Constitutional AI (reasoning with CoT)

Benchmarks:

Code implementations:


Continue the AI reasoning journey:

  1. Paper 15: Training Language Models to Follow Instructions with Human Feedback (RLHF/InstructGPT) — How RLHF uses CoT to improve instruction-following
  2. Paper 16: Let’s Verify Step by Step (Self-Verification) — Making sure reasoning steps are actually correct
  3. Zero-Shot CoT deep dive — Large Language Models are Zero-Shot Reasoners

Navigation

Paper 13: Scaling Laws of Neural Language Models | Paper 15: Training Language Models to Follow Instructions with Human Feedback

🎉 You've finished this paper!