Section 08

Impact: What Changed After This Paper

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 2022

Impact: What Changed After This Paper

Chain-of-Thought (CoT) was published in January 2022. Within months, it became the standard approach for reasoning tasks in large language models. Here’s what changed:

Immediate Follow-Ups (2022–2023)

Zero-Shot CoT (Kojima et al., Feb 2022)

The problem: You need human-written reasoning examples for standard CoT. What if you don’t have them?

The solution: Just add “Let’s think step by step” to your prompt.

Standard zero-shot:
Q: A store has 50 books. They receive 30 more. They sell 20.
A: ___

Zero-shot CoT:
Q: A store has 50 books. They receive 30 more. They sell 20.
Let's think step by step.
A: ___

Result: On GSM8K, zero-shot CoT achieved 41% accuracy on GPT-3 (without any examples). That’s compared to 17% with standard zero-shot.

Impact: Made CoT accessible to any task, any domain. No need to craft examples.

Self-Consistency (Wang et al., Mar 2022)

The problem: A single reasoning chain might be wrong. What if you sample multiple chains?

The solution: Generate K different reasoning chains, extract K different answers, and take the majority vote.

Sample 1: Reasoning → Answer: 60
Sample 2: Reasoning → Answer: 60
Sample 3: Reasoning → Answer: 65
Sample 4: Reasoning → Answer: 60
Sample 5: Reasoning → Answer: 60

Final answer (majority): 60

Result: On GSM8K, self-consistency achieved 71% on text-davinci-002 (vs. 58% single CoT).

Impact: Pushed accuracy higher; used in production systems; standard practice for high-stakes reasoning.

Program-of-Thought (PoT) / Code-as-Reasoning (Gao et al., 2022)

The problem: Language models are good at generating code. What if we use code execution instead of natural language reasoning?

The solution: Generate Python code to solve the problem, then execute it.

Q: A store has 50 books. They receive 30 more. They sell 20.
A: 
starting_books = 50
received = 30
sold = 20
final = starting_books + received - sold
print(final)
# Output: 60

Result: Eliminated unfaithful reasoning (if code runs, the answer is correct).

Impact: Used in production (e.g., Wolfram|Alpha integration, tool-using agents).

Tree-of-Thought (ToT) (Yao et al., May 2023)

The problem: Linear chains of thought explore one path. What if you explore multiple reasoning trees?

The solution: Use a tree search algorithm to explore multiple reasoning paths and their continuations.

Q: Solve a complex puzzle

                    Root (question)
                   /    |    \
                Path1  Path2  Path3
                /        |      \
             Goal    Dead-end  Continue...

Result: Better performance on complex tasks (e.g., 73% on Game of 24 vs. 66% with CoT).

Impact: Shifted thinking from linear to branching reasoning; paved the way for search-based planning.

Adoption in Production Systems

ChatGPT and Claude

Both ChatGPT (OpenAI) and Claude (Anthropic) use chain-of-thought prompting internally:

  • They generate reasoning steps before answering complex questions
  • They use variants like Constitutional AI (Anthropic) that emphasize step-by-step reasoning
  • They show reasoning to users for transparency

Reasoning Models: OpenAI o1 and DeepSeek R1

In late 2024–2025, the “reasoning models” era emerged:

OpenAI o1 (November 2024):

  • Explicitly allocates test-time compute to reasoning
  • Generates extended internal reasoning chains before producing outputs
  • Directly inspired by CoT, but at inference time with massive compute budgets

DeepSeek R1 (January 2025):

  • Similar approach: long reasoning chains, then answers
  • Open-source alternative to o1

Both treat reasoning as a first-class citizen in the model architecture.

Theoretical Insights

CoT revealed important properties of large language models:

1. Emergent Reasoning

  • Reasoning emerges at scale (100B+), not below
  • This led to renewed interest in scaling laws and emergent capabilities
  • Similar patterns observed later in code generation, instruction-following, etc.

2. In-Context Learning

  • CoT demonstrated that models can learn procedures from examples, not just patterns
  • This led to more sophisticated in-context learning research (prompt engineering, retrieval-augmented generation)

3. Decoupling Generation from Computation

  • CoT showed that showing work (generation) improves results
  • This insight is central to recent work on test-time compute (spending more inference time on hard problems)

Cascading Research

Many subsequent papers built on CoT:

PaperContribution
Least-to-Most PromptingSolve sub-problems before harder problems
Decompose-Then-IntegrateBreak complex tasks into parts, integrate results
Faithful CoT ExplanationVerify that reasoning actually matches the model’s computation
Automatic CoTLearn to generate CoT examples automatically (no humans)
RLHF with CoTTrain reward models that prefer step-by-step reasoning
Chain-of-CodeAlternate between natural language and code for reasoning

Impact on InstructGPT and RLHF (Paper 15)

CoT directly influenced how InstructGPT (and later ChatGPT) was trained:

  • Reward models (RLHF) gave higher scores to outputs with reasoning chains
  • Fine-tuning incentivized the model to explain its thinking
  • This became the standard for instruction-following models

Key Insight: Cost vs. Benefit

CoT’s impact stems from a fundamental insight:

For free (just prompt engineering), you get:

  • 2-3× accuracy improvement on reasoning tasks
  • Better explainability (users see reasoning)
  • Better debugging (easier to catch errors)

For small cost:

  • 2-3× longer prompts (slightly higher token cost)
  • Slightly higher latency (more tokens to generate)

This favorable trade-off made CoT ubiquitous.

Long-Term Legacy

Today, in 2025:

  • CoT is foundational knowledge for any LLM engineer
  • Zero-shot CoT (“Let’s think step by step”) is a basic prompt engineering technique
  • Self-consistency is standard for high-stakes tasks
  • Reasoning-focused models (o1, R1) are the frontier of AI capability

Without this paper: The LLM field would still be optimizing prompt templates.

With this paper: We learned that reasoning is learnable, emergent, and dramatically improvable through simple prompting. That insight shaped the entire roadmap of LLM development through 2025.