The Idea: Show Your Work

Chain-of-Thought Prompting

The core insight is deceptively simple: include the reasoning steps in the few-shot examples.

Instead of:

Q: A store has 50 books. They order 30 more. They sell 20.
   How many books remain?
A: 60

Write:

Q: A store has 50 books. They order 30 more. They sell 20.
   How many books remain?

A: Let me work through this step by step.
   Starting inventory: 50 books
   After ordering: 50 + 30 = 80 books
   After selling: 80 - 20 = 60 books
   So the store has 60 books remaining.

When the model sees these examples with explicit intermediate steps, it learns to generate intermediate steps for the test question too.

Why This Works: Three Reasons

1. The Model Learns the Pattern of Reasoning

By showing examples where reasoning steps are written out, the model learns that this is how you solve these problems. It’s not just pattern-matching to final answers; it’s learning a procedure.

Think of it like showing a student the worked solution, not just the answer. The student internalizes the method, not just the result.

2. Decomposition Reduces Error

Breaking the problem into smaller parts is easier. The model can:

Identify which numbers are relevant
Apply the correct operation to each pair
Combine intermediate results

Multi-step decomposition is less error-prone than trying to do everything in one shot.

3. The Model Can Self-Correct

When the model writes intermediate steps, it can sometimes catch its own mistakes:

Q: Sarah has 10 apples. She gives 3 to her friend and buys 7 more.
   How many apples does she have?

Generated Reasoning (with self-correction):
- Sarah starts with 10 apples.
- She gives 3 away: 10 - 3 = 7 apples.
- She buys 7 more: 7 + 7 = 14 apples.
- (Wait, let me double-check: 10 - 3 + 7 = 14. Yes, that's right.)

Answer: 14 apples

The act of writing intermediate steps creates a paper trail the model can mentally review.

The Indian Analogy (Expanded)

In an Indian coaching class preparing students for JEE or board exams, here’s what separates good students from great ones:

A struggling student (standard prompting):

Reads the problem quickly
Jumps to a formula
Writes a number
Moves on
Gets it wrong 60% of the time

A strong student (chain-of-thought):

Reads the problem carefully
Writes: “Given: mass = 5 kg, acceleration = 2 m/s², find force”
Writes: “Formula: F = ma”
Writes: “F = 5 × 2 = 10 N”
Checks: “Unit is Newtons. Does this make sense? Yes.”
Gets it right 95% of the time

The strong student doesn’t just know the formula; they show the path from problem to answer. This visible reasoning catches errors, clarifies thinking, and produces better results.

Chain-of-thought prompting teaches the language model to be the strong student.

The Emergent Capability

Here’s the shocking part: chain-of-thought only works for large models.

The paper tests this on models of different sizes:

Model Size      | Standard Prompting | CoT Prompting | Improvement
----------------|-------------------|---------------|--------------
8B parameters   | 10%               | 11%           | +1% (almost nothing)
62B parameters  | 14%               | 16%           | +2% (small)
540B parameters | 17%               | 58%           | +41% (massive)

This is the key discovery: reasoning is an emergent capability. Small models don’t have the capacity to learn reasoning from examples. But large models do — and when shown examples with reasoning steps, they activate this dormant capability.

It’s like the difference between telling a toddler “say the word step-by-step” (useless; they can’t) and telling a teenager the same thing (suddenly they’re self-aware about their speech).

What Chain-of-Thought Changes

In the model’s token-generation process:

Without CoT:

Input: [problem]
Model decodes: [answer token] [end]

With CoT:

Input: [problem]
Model decodes: [reasoning token 1] [reasoning token 2] ... [answer token] [end]

The model is still doing next-token prediction. But it’s predicting reasoning tokens first, which constrains what answer tokens make sense. It can’t just output a random number; the number must follow logically from the reasoning.

This constraint — forced to make the answer logically consistent with prior reasoning — is what makes CoT work.

Formal Definition

Chain-of-Thought Few-Shot Prompt Structure:

Few-shot examples:
    (x₁, r₁, y₁)  where x₁ = question, r₁ = reasoning chain, y₁ = answer
    (x₂, r₂, y₂)
    ...
    (xₖ, rₖ, yₖ)

Test input:
    x_test

Model generates:
    r_test (reasoning chain for test)
    y_test (answer based on reasoning)

The model learns to generate both the reasoning and the answer, and to ensure they’re consistent.