Conditional Probability
Conditional Probability
🟡 Intermediate. You need basic probability (what P(A) means, how to count outcomes). Read Probability Basics first.
The core question
You roll a six-sided die. Someone peeks at the result and tells you “the number is even.” What is the probability it is a 6?
Without that hint, P(6) = 1/6. With the hint, you know the result is one of {2, 4, 6} — three equally likely outcomes — and only one of them is 6. So P(6 | even) = 1/3.
That vertical bar is the symbol for “given.” P(6 | even) means “the probability of rolling a 6, given that the roll was even.”
The formula:
P(A | B) = P(A ∩ B) / P(B)
Read: “the probability of A given B equals the probability that both A and B happen, divided by the probability that B happens.”
In the die example:
- A = “roll is 6”, B = “roll is even”
- P(A ∩ B) = P(“roll is 6 AND even”) = P(6) = 1/6 (6 is even, so this is just P(6))
- P(B) = P(“roll is even”) = 3/6 = 1/2
- P(A | B) = (1/6) / (1/2) = 2/6 = 1/3 ✓
The Indian analogy
Think of a bus stop. Every 10 minutes, a bus arrives. Of those buses, 60% go to Chowk (town centre) and 40% go to the railway station. Of the Chowk buses, 70% are red (government buses). Of the railway buses, only 20% are red.
You see a red bus approaching. What is the probability it goes to Chowk?
This is the classic conditional probability setup. The new information (“it’s red”) changes your probabilities. Without seeing the colour, you’d say 60% Chowk. With the colour, you have to update — because red buses are much more common on the Chowk route. The answer turns out to be about 84% (we’ll compute this below).
Numerical example 1: the die
Already done above. P(6 | even) = 1/3.
Let us verify with counting. When the die is even, the sample space shrinks from {1,2,3,4,5,6} to {2,4,6}. In this reduced space, 6 appears once out of three. P(6|even) = 1/3. Same answer, no formula needed — but the formula works for any probability, not just equally-likely outcomes.
Numerical example 2: the bus stop (Bayes’ theorem preview)
Setup:
- P(Chowk) = 0.60, P(Railway) = 0.40
- P(Red | Chowk) = 0.70
- P(Red | Railway) = 0.20
Question: P(Chowk | Red) = ?
Step 1 — find P(Red) using the law of total probability:
P(Red) = P(Red | Chowk) × P(Chowk) + P(Red | Railway) × P(Railway)
= 0.70 × 0.60 + 0.20 × 0.40
= 0.420 + 0.080
= 0.500
So 50% of buses are red overall.
Step 2 — apply the formula:
P(Chowk | Red) = P(Red ∩ Chowk) / P(Red)
= P(Red | Chowk) × P(Chowk) / P(Red)
= (0.70 × 0.60) / 0.500
= 0.420 / 0.500
= 0.840
A red bus is 84% likely to be a Chowk bus, even though only 60% of buses overall go to Chowk. The colour gave you information.
The chain rule of probability
Here is the most important identity for language models.
For two events:
P(A ∩ B) = P(A) × P(B | A)
This just rearranges the definition: P(A|B) = P(A∩B)/P(B) → P(A∩B) = P(A|B)×P(B). Symmetric in A and B.
For three events:
P(A ∩ B ∩ C) = P(A) × P(B | A) × P(C | A, B)
For four events:
P(A ∩ B ∩ C ∩ D) = P(A) × P(B | A) × P(C | A, B) × P(D | A, B, C)
The pattern: you can always factor a joint probability into a chain of conditionals. Each factor conditions on everything that came before.
Why this matters: there is no approximation here. The chain rule is exact for any probability distribution.
Numerical example 3: the chain rule
Probability that a random student in class: (A) is in the science stream, (B) scored above 80% in maths, (C) chose engineering for college.
P(science) = 0.40
P(above 80% | science) = 0.50
P(engineering | science AND above 80%) = 0.80
Joint probability (all three):
P(science ∩ above 80% ∩ engineering)
= P(science) × P(above 80% | science) × P(engineering | science, above 80%)
= 0.40 × 0.50 × 0.80
= 0.160
16% of students are science students who scored above 80% and chose engineering. Each factor narrows the probability by conditioning on what came before.
Conditional probability in language models
A sentence is a sequence of words: w₁, w₂, w₃, …, wₙ.
The probability of the full sentence is:
P(w₁, w₂, w₃, ..., wₙ)
Using the chain rule, this factors exactly as:
P(w₁, w₂, ..., wₙ) = P(w₁) × P(w₂ | w₁) × P(w₃ | w₁, w₂) × ... × P(wₙ | w₁, ..., wₙ₋₁)
In product notation:
P(sentence) = ∏ₜ P(wₜ | w₁, w₂, ..., wₜ₋₁)
This is the autoregressive factorisation. It says: the probability of a sentence equals the product of the probability of each word given all the words before it.
Why is this useful? Because it converts the hard problem of “assign a probability to this whole sentence” into a chain of simpler problems: “given the words so far, what is the probability of the next word?”
A language model learns to estimate P(wₜ | w₁, …, wₜ₋₁) for every possible context. Once it can do that, it can:
- Score any sentence (multiply the conditional probabilities)
- Generate text (sample the next word repeatedly)
Numerical example 4: sentence probability
Sentence: “chai bahut garam hai” (the tea is very hot)
Suppose a language model has learned these conditional probabilities from a corpus:
P(chai) = 0.010 ← "chai" appears in 1% of sentences
P(bahut | chai) = 0.200 ← 20% of the time after "chai", "bahut" follows
P(garam | chai bahut) = 0.150 ← 15% of the time after "chai bahut", "garam" follows
P(hai | chai bahut garam) = 0.700 ← 70% of the time after that trigram, "hai" follows
Joint probability of the full sentence:
P("chai bahut garam hai")
= P(chai) × P(bahut | chai) × P(garam | chai bahut) × P(hai | chai bahut garam)
= 0.010 × 0.200 × 0.150 × 0.700
= 0.000210
This sentence has probability 0.021% — low in absolute terms (there are millions of possible sentences), but reasonable for a four-word phrase. Now compare:
“chai bahut ठंडा है” (the tea is very cold) — same structure, but “ठंडा” after “chai bahut” might have P = 0.050, giving 0.010 × 0.200 × 0.050 × 0.700 = 0.000070. Less likely than the “garam” version — perhaps because in this corpus, tea is more often mentioned as hot.
What language models learn to predict
A language model trained on the autoregressive objective sees a sequence like:
[chai] [bahut] [garam] [???]
and must predict the distribution over all possible next words. It learns to give high probability to “hai”, lower probability to “tha” (was), very low probability to “rocket”, essentially zero to random symbols.
The training loss is the negative log-likelihood:
L = − Σₜ log P(wₜ | w₁, ..., wₜ₋₁)
Minimising this loss is equivalent to making the model assign high probability to the actual next words in the training data. After training on millions of sentences, the model’s conditional distributions capture grammar, facts, reasoning patterns — everything that determines which words follow which.
Log probabilities: we use logs because probabilities multiply (chain rule), but log-probabilities add, which is numerically stable. Also, log(small number) is a large negative number, making it easy to see when a prediction is bad.
log(0.010 × 0.200 × 0.150 × 0.700)
= log(0.010) + log(0.200) + log(0.150) + log(0.700)
= (−4.605) + (−1.609) + (−1.897) + (−0.357)
= −8.468
Negative log-likelihood of “chai bahut garam hai” = 8.468 nats. Lower is better.
Independence vs. dependence
Two events A and B are independent if knowing B tells you nothing about A:
P(A | B) = P(A) ↔ A and B are independent
Equivalently: P(A ∩ B) = P(A) × P(B).
In language, almost nothing is independent. Knowing the previous words tells you a lot about the next word. That dependence is exactly what language models exploit. If words were independent, a model could not do better than letter-frequency tables — it would have no understanding of grammar or meaning.
Key formulas — quick reference
Definition: P(A | B) = P(A ∩ B) / P(B)
Product rule: P(A ∩ B) = P(A) × P(B | A)
Chain rule (n vars): P(x₁,...,xₙ) = ∏ᵢ P(xᵢ | x₁,...,xᵢ₋₁)
Independence: P(A | B) = P(A) [iff A, B independent]
LM objective: L = − Σₜ log P(wₜ | w₁,...,wₜ₋₁)
Summary
Conditional probability P(A|B) measures how knowing B changes your belief about A. The chain rule lets you factor any joint probability into a product of conditionals. Applied to sentences, this gives the autoregressive language modelling objective: predict each word given the words before it. GPT-1, GPT-2, GPT-3, and every modern decoder-only language model is trained on exactly this objective — billions of applications of P(wₜ | w₁, …, wₜ₋₁).
Next: GPT-1 — Paper 10 →