Paper 14

Further Reading: Chain-of-Thought Prompting

Further Reading: Chain-of-Thought Prompting

Dive deeper into chain-of-thought reasoning and related work.


The Original Paper

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou
NeurIPS 2022 | January 2022

The foundational paper. Introduces CoT, demonstrates emergence at scale (100B+), and benchmarks on GSM8K, MATH, StrategyQA, and AQuA. Essential reading.


Key Follow-Up Papers (Read These Next)

Zero-Shot CoT: Removing the Need for Examples

Large Language Models are Zero-Shot Reasoners
Kojima, Gu, Reid, Matsuo, Iwasawa
NeurIPS 2022 | May 2022

Key insight: You don’t need human-written reasoning examples. Simply adding “Let’s think step by step” to the prompt enables reasoning on large models.

Results: On GSM8K, GPT-3 achieved 41% with zero-shot CoT (vs. 17% standard). Made CoT accessible to any task without manual example creation.

Why read it: Directly addresses the practical limitation of CoT (needing good examples). Shows that the emergent capability is so strong, even random prompts for reasoning work.


Self-Consistency: Voting on Multiple Chains

Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang, Wei, Zhou, Huang, Kumar, Liu, Shi, Chang, Cui
NeurIPS 2022 | March 2022

Key insight: Instead of generating one reasoning chain, generate multiple (e.g., 5) chains and take a majority vote on the answer. The diversity of reasoning paths compensates for individual errors.

Results: On GSM8K, self-consistency pushed text-davinci-002 from 58% to 71% (single CoT to majority vote). Now standard practice for high-stakes reasoning.

Why read it: Shows how to get even better accuracy by trading off inference cost. Also demonstrates that reasoning chains aren’t deterministic—different decoding temperatures produce different (but valid) reasoning paths.


Code-as-Reasoning: Executable Reasoning

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
Gao, Mao, Chen, Pasupat, Abdelaziz, Klakow, Meng, Sen
ICLR 2023 | November 2022

Key insight: Instead of generating reasoning in natural language, generate Python code that solves the problem. Then execute the code to get the answer.

Results: Eliminates unfaithful reasoning (if code runs, answer is provably correct). Achieves strong performance on numerical tasks. Code execution makes reasoning transparent and verifiable.

Why read it: Addresses a fundamental limitation of CoT (unfaithful reasoning). Bridges reasoning and tools/computation.


Tree-of-Thought: Branching Reasoning

Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Yao, Yu, Zhao, Shang, Yuan, McAuliffe, Sun, Dai
NeurIPS 2023 | May 2023

Key insight: CoT explores one linear path. What if you explore multiple branches, like a search tree? Use tree search to find the best reasoning path.

Results: On Game of 24 (a puzzle game), ToT achieved 73% vs. 66% with standard CoT. Particularly strong on tasks with branching decision points.

Why read it: Extends CoT beyond linear chains to structured search. Paves the way for more sophisticated reasoning algorithms.


Least-to-Most Prompting: Decomposing Hard Problems

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Zhou, Schärli, Hou, Cai, Chang, Liu, Sui, Clune, Schuurmans
ICLR 2023

Key insight: Solve sub-problems in order of increasing complexity. Solve simple versions first, then use those solutions to solve harder versions.

Why relevant: Another form of problem decomposition, complementary to CoT. Useful for compositional reasoning.


Constitutional AI: Chain-of-Thought with Principles

Constitutional AI: Harmlessness from AI Feedback
Bai, Kadavath, Kundu, Askell, Kernian, Jones, Chen, Conubeer, Conerly, Drain, Ghosh, Jackson, Hernandez, Hernandez, Herrick, Joseph, Kravec, Kravtsov, Loer, Olsson, Olton, Picciotto, Saunders, Sang, Santagata, Satterfield, Schroeder, Shih, Shivakumar, Sokol, Song, Staudacher, Such, Theriault, Tindall, Tsvetkova, Tworkowski, Wang, Weiss, WeLB, Weng, Weys, Whitelaw, Wiethoff, Willson, Wirth, Witter, Xia, Yan, Zaremba, Zellers, Zhang, Zhong, Zhou, Zhuang, Zoph
EMNLP 2023

Key insight: Use CoT in an RLHF setting where the reward model is trained to evaluate reasoning steps, not just final answers. Combines CoT with Constitutional AI principles.

Why relevant: Shows how CoT integrates with instruction-following and alignment (the next paper in this series, RLHF/InstructGPT).


Faithful Reasoning: Verifying that CoT is Real

Towards Faithful Reasoning in Large Language Models with Symbolic Planning and Grounding
Thawani, Prabhumoye, Deschamps
NeurIPS 2023 Workshop

Key insight: CoT reasoning is often unfaithful. Can we verify that reasoning actually led to the answer? Proposes grounding reasoning in symbolic logic.

Why relevant: Addresses the unfaithful reasoning limitation. Important for safety-critical applications.


Benchmarks and Datasets

GSM8K: Grade-School Math

Solving Quantitative Reasoning Problems with Language Models
Cobbe, Kosaraju, Bavarian, Chen, Jun, Kaiser, Plappert, Tworek, Hilton, Nakano, Hesse, Schulman
NeurIPS 2021

The benchmark used to evaluate CoT in the original paper. 8,500 grade-school math word problems.

Access: GitHub: openai/grade-school-math


MATH: Competition-Level Mathematics

Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, Burns, Kadavath, Ramamurti, Zhou, Basart, Wang, Carlini, Perez, Pettit
NeurIPS 2021

12,500 competition math problems (high school and undergraduate level). Much harder than GSM8K. Used to test CoT on harder reasoning.

Access: GitHub: openai/MATH


StrategyQA: Multi-Hop Reasoning

Did the Model Understand the Question?
Geva, Khot, Srikumar
ACL 2021

Multi-hop reasoning questions requiring chaining ideas across paragraphs. CoT helped on this benchmark because it forces explicit intermediate steps.

Access: GitHub: allenai/strategyqa


Blog Posts and Tutorials

Hugging Face: Chain-of-Thought Prompting

Hugging Face has excellent tutorials on prompt engineering, including detailed guides on chain-of-thought. Search “Chain-of-Thought Prompting” on huggingface.co.

Why read it: Practical implementation details, code examples, performance comparisons across models.


OpenAI Cookbook: Using Chain-of-Thought with GPT-4

OpenAI’s cookbook has examples of using CoT with their models (GPT-3.5, GPT-4).

Access: github.com/openai/openai-cookbook


Anthropic: Chain-of-Thought and Constitutional AI

Anthropic’s blog and papers on Constitutional AI explain how CoT is used in their alignment process.

Why read it: Shows how CoT integrates with RLHF and safety-focused reasoning.


Advanced Topics

Scaling Laws and Emergence

Emergent Abilities of Large Language Models
Wei, Tay, Bommasani, et al.
arXiv 2022

Comprehensive survey of emergent capabilities in LLMs, including reasoning. Positions CoT in the broader context of emergence.

Why read it: Theoretical understanding of why reasoning emerges at scale.


Test-Time Compute

Scaling Laws for Transfer
Bahri, Dyer, Kaplan, Lee, Sharma
arXiv 2021

Early work on test-time compute trade-offs. CoT is a form of test-time compute. Later work (OpenAI o1, DeepSeek R1) pushes this much further.

Why read it: Foundational concepts for understanding why models benefit from “thinking” (generating more tokens) at inference.


What’s Coming Next (2025+)

Reasoning Models: o1 and Beyond

OpenAI o1 (November 2024) and DeepSeek R1 (January 2025) represent the frontier: models that spend massive computation at inference time for reasoning.

These models directly extend the CoT insight:

  • If thinking helps, allocate more compute for thinking
  • Use RL to train models to reason effectively at test time
  • Achieve 90%+ on MATH, 97%+ on GSM8K

These papers will likely be released in 2025. Follow OpenAI and DeepSeek’s research pages.


Quick Reference: The CoT Ecosystem (2022–2025)

2022 Jan: Chain-of-Thought Prompting (Wei et al.) ← You are here

2022 Feb: Zero-Shot CoT (Kojima et al.)

2022 Mar: Self-Consistency (Wang et al.)

2022 May: Least-to-Most Prompting (Zhou et al.)

2022 Nov: Program-of-Thoughts (Gao et al.)

2023 May: Tree-of-Thoughts (Yao et al.)

2023 Dec: Constitutional AI (Bai et al.)

2024+: Reasoning Models (o1, R1) — massive test-time compute

Key Papers to Read in Order

  1. This paper: Chain-of-Thought Prompting — Foundation
  2. Zero-Shot CoT — Remove examples requirement
  3. Self-Consistency — Improve accuracy via voting
  4. Program-of-Thoughts — Code as reasoning
  5. Tree-of-Thoughts — Search over reasoning paths

Then read the next paper in this series: Paper 15: RLHF / InstructGPT — How CoT integrates with instruction-following.


Code Implementations

Official: Google Research

The original authors’ code repository:

github.com/google-research/google-research/tree/master/chain_of_thought

Includes evaluation scripts and prompt templates.

Hugging Face Transformers

Most examples work with Transformers library:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "Q: ...\nA: Let me think step by step."
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(inputs, max_length=200)

API-Based: OpenAI, Anthropic

Most modern LLM APIs support CoT out of the box. Just include reasoning examples in your system prompt.


Tools and Extensions

Prompt Caching for CoT

Since CoT prompts are longer, prompt caching (storing repeated context) can reduce cost. OpenAI supports prompt caching for CoT examples.


Navigation: ← Back to Paper 14