The Math: Language Modeling and In-Context Learning
GPT-3 uses the same objective function as GPT-1. The innovation is scale, not mathematics. But understanding the math clarifies how in-context learning works.
Prerequisites: Cross-Entropy Loss, Conditional Probability
The Objective: Causal Language Modeling
The model learns to predict the next token given all previous tokens. This is called causal language modeling (you only attend to past context).
Objective:
L = -1/N * Σ log P(u_i | u_1, u_2, ..., u_{i-1})
where:
u_i = the i-th token in the sequence
N = total number of tokens
P(u_i | u_1,...,u_{i-1}) = probability the model assigns to token u_i
given all previous tokens
log P(...) = log-probability (smaller values = model is confident)
The negative sign and sum = cross-entropy loss
We minimize this loss: the model learns to assign high probability to the correct next token.
Worked Example: Computing Cross-Entropy Loss
Sequence: “I love cats”
Tokens: [I, love, cats]
Let’s compute the loss. Assume the vocabulary has 50,000 words.
Step 1: Predict token 2 (love) from token 1 (I)
The model outputs probabilities over all 50,000 words. Let’s say:
- P(love | I) = 0.3
- P(dogs | I) = 0.2
- P(hate | I) = 0.1
- [all other words share remaining 0.4]
The correct token is “love”. Loss contribution:
L_1 = -log(0.3) = -(-1.204) = 1.204
(Higher probability → lower loss. If P(love|I) were 0.9, loss would be -log(0.9) = 0.105.)
Step 2: Predict token 3 (cats) from tokens 1–2 (I love)
The model computes:
- P(cats | I, love) = 0.5
- P(dogs | I, love) = 0.2
- P(people | I, love) = 0.1
- [remaining 0.2]
The correct token is “cats”. Loss contribution:
L_2 = -log(0.5) = 0.693
Step 3: Total Loss
Average loss over 2 tokens:
L = (L_1 + L_2) / 2 = (1.204 + 0.693) / 2 = 0.9485
The model learns by minimizing this. If it can increase P(love|I) from 0.3 to 0.8 and increase P(cats|I,love) from 0.5 to 0.9, the loss drops to:
L = (-log(0.8) - log(0.9)) / 2 = (0.223 + 0.105) / 2 = 0.164
Much lower loss = better model.
In-Context Learning: Conditional Probability
During inference (generation), the model doesn’t change its weights. Instead, the prompt conditions the probability distribution.
For a sentiment task:
Prompt examples:
[Review: "great movie", Sentiment: positive]
[Review: "bad food", Sentiment: negative]
Task:
[Review: "nice book", Sentiment: ?]
The full input is a sequence of tokens:
[Review:] [great] [movie] [Sentiment:] [positive] [Review:] [bad] [food] [Sentiment:] [negative] [Review:] [nice] [book] [Sentiment:] [?]
The model predicts the next token after “Sentiment:”. It computes:
P(next token | all previous tokens in the prompt)
Because the previous tokens include examples, the model’s distribution shifts. The model learns:
- From examples: sentiment tasks show review → sentiment pairs
- Pattern inference: the prompt shows positive after “great”, negative after “bad”
- Activation: when it sees “nice”, it activates positive because the pattern matches “great”
All of this happens in the forward pass (inference). No weight updates.
Formal Definition
In-context learning on a task with examples (x_1, y_1), …, (x_k, y_k) and a test input x_test:
Input sequence: [x_1, y_1, x_2, y_2, ..., x_k, y_k, x_test]
Output prediction: arg max P(y | x_1, y_1, ..., x_k, y_k, x_test)
y
The model predicts y by assigning the highest probability to the likely completion.
The model is trained on the objective:
L = -Σ log P(u_i | u_1, ..., u_{i-1})
applied to all training sequences. So it’s trained to predict the next token given context. During in-context learning, the “context” includes prompt examples.
Why Scale Enables In-Context Learning
Smaller models (GPT-1, 117M) can do in-context learning weakly because they have limited capacity to store knowledge. When asked to hold both the task pattern and generate the answer, they fail often.
Larger models (GPT-3, 175B) can hold the task pattern in attention and in the hidden states while generating the answer. They have enough capacity:
- To store patterns about what a sentiment classifier should do
- To recognize the task from examples
- To apply the pattern to new inputs
Mathematically, this isn’t a different mechanism. It’s the same transformer forward pass. But the capacity allows the pattern-matching to work.
The Attention Mechanism’s Role
The transformer’s self-attention layer is key to in-context learning:
Query: q = W_q * h_i (current token representation)
Key: k = W_k * h_j (all previous tokens)
Value: v = W_v * h_j (all previous tokens)
Attention weights: α_ij = softmax( (q · k) / √d )
Output: Σ α_ij * v_j (weighted sum of values)
When the model attends to the prompt examples (high α for example tokens), it learns from them. When it attends to the test input, it applies that learning.
The 96 attention heads in GPT-3 allow different parts of the model to attend to different aspects simultaneously: one head might focus on the task format, another on semantic similarity between examples and the test input.
Worked Example: Attention in In-Context Learning
Consider the input sequence (simplified, using position-based indexing):
Position 0: "Review:"
Position 1: "great"
Position 2: "movie"
Position 3: "Sentiment:"
Position 4: "positive"
Position 5: "Review:"
Position 6: "bad"
Position 7: "food"
Position 8: "Sentiment:"
Position 9: "negative"
Position 10: "Review:"
Position 11: "nice"
Position 12: "book"
Position 13: "Sentiment:"
Position 14: ?
When the model generates the token at position 14, it:
- Computes attention weights over positions 0–13 (all previous tokens)
- Might attend heavily to position 4 (positive) and position 9 (negative) because they’re example sentiments
- Computes semantic similarity between “nice book” (positions 11–12) and “great movie” (positions 1–2), finding them similar
- Copies activation patterns from position 4 (positive) to predict “positive” at position 14
This is all done in the forward pass, via attention. No weight updates.
Key Equations Summary
Causal Language Model Loss:
L = -1/N * Σ log P(u_i | u_1, ..., u_{i-1})
In-Context Learning Setup:
Input: [Example tokens] + [Test input tokens]
Output: P(y_test | example tokens + test input context)
Attention:
Attention(Q, K, V) = softmax((Q * K^T) / √d_k) * V
Full Transformer:
output = Attention(input, input, input) + input [+ layer norm]
output = FFN(output) + output [+ layer norm]
(repeat 96 times for GPT-3)
No new equations compared to GPT-1. The innovation is in how scale enables these mechanisms to work powerfully.
Key Takeaways from This Section
- Objective: Minimize cross-entropy loss on causal language modeling.
- In-context learning: Prompt examples condition the probability distribution; the model learns via attention in the forward pass.
- No fine-tuning: All learning happens through the input prompt, not weight updates.
- Attention is the mechanism: Different heads attend to different parts of the prompt examples and test input.
- Scale enables capacity: 175B parameters allow the model to hold both task patterns and generate answers.