The Problem: Standard Prompting Fails at Reasoning

Standard Few-Shot Prompting

Here’s how you typically use a language model to solve a problem. You give it a few examples (few-shot prompting):

Q: If there are 3 cars in the parking lot and 2 more arrive,
   how many cars are there?
A: 5

Q: A baker makes 12 cookies. He sells 5. How many remain?
A: 7

Q: Sarah has 20 apples. She gives 8 to her friend.
   Her friend gives her 3 back. How many does Sarah have?
A: 15

Then you ask the model the test question:

Q: A store has 50 books. They order 30 more.
   They sell 20. How many books remain?
A: ___

And the model generates: 50 (wrong). Or 60 (wrong). Or 80 (sometimes right by luck).

Why Standard Prompting Fails

The model is not actually solving the problem. It’s:

Pattern-matching to similar sentences in the training data
Predicting a plausible number without executing logic
Guessing the format of the answer

When the problem gets complex (multi-step, unusual numbers, distractors), the model has no internal process to fall back on. It just outputs a number that “sounds right” — and is usually wrong.

Concrete Failure Examples

Example 1: Multi-step arithmetic

Q: Mira starts with 10 candies. She buys 2 packs, each with 5 candies.
   Then she eats 4. How many candies does she have?

Standard Prompting Output: 15
Correct Answer: 16

(Model fails because it sees "10" and "2 packs of 5" but doesn't 
reliably execute 10 + 2×5 - 4 = 10 + 10 - 4 = 16)

Example 2: Commonsense reasoning with negation

Q: All birds can fly. Penguins are birds. Can penguins fly?

Standard Prompting Output: Yes
Correct Answer: No

(Model gets confused by "all" and "birds" and misses the 
real-world fact that contradicts the general rule)

Example 3: Word problem with distractors

Q: A farmer has 8 cows. Each cow produces 10 liters of milk.
   The farmer's name is James. He also has 3 chickens.
   How many liters of milk does he have?

Standard Prompting Output: 80 or 83 or 24
Correct Answer: 80

(Model gets distracted by irrelevant details—the farmer's name,
the chickens—and may include or exclude them randomly)

The Core Issue

Multi-step reasoning requires:

Breaking the problem into substeps
Solving each substep correctly
Combining results to get the final answer

A language model trained only to predict the next token has no explicit mechanism for this. It’s trained to match patterns, not to execute algorithms.

Empirical Data (Before CoT)

On GSM8K (grade-school math), the paper reports:

GPT-3 (175B, standard prompting): 17% accuracy
PaLM (540B, standard prompting): 17% accuracy
Even bigger models don’t help much with standard prompting

The plateau is real: throwing more parameters at the problem doesn’t solve it if the model isn’t reasoning.

The question becomes: Can we change how we present the problem to make the model reason?