Probability Basics

1. What is this and why do we care?

Every language model — GPT, Claude, Gemini — is, at its heart, a probability machine.

When you type “The capital of India is” and the model completes it with “New Delhi,” what actually happened is: the model computed the probability of every possible next word, and picked the most likely one. The probability of “New Delhi” was very high. The probability of “banana” was very low.

When a Perceptron makes a decision, it is making a probabilistic judgement — this input probably belongs to class 1, rather than definitely belongs. When we train a model, we measure how wrong it is using a concept called cross-entropy — which is built on probability.

Probability is not optional background knowledge for AI. It is the language AI speaks.

2. Prerequisites

None. Basic arithmetic only — you need to know fractions and percentages. If you know that 50% = 0.5 = 1/2, you already have the foundation.

3. The intuition — before any symbols

Every day in India, people make probabilistic statements without realising it:

“Kal shayad barish hogi.” (There might be rain tomorrow.)
“India will probably win this match.”
“He might get into IIT.”
“The bus is usually late.”

All of these are probability statements. They express how likely something is — not certainty, not impossibility, but somewhere in between.

Probability is a way of turning that “how likely” feeling into a precise number between 0 and 1.

0 means: impossible. It will definitely not happen.
1 means: certain. It will definitely happen.
0.5 means: exactly 50-50. A coin toss.
0.9 means: very likely. India winning against Namibia, perhaps.
0.1 means: unlikely, but possible.

4. A tiny worked example with real numbers

Example: Tossing a fair coin

A fair coin has two outcomes: Heads (H) and Tails (T). Each is equally likely.

P(Heads) = 1/2 = 0.5
P(Tails) = 1/2 = 0.5

Check: the probabilities of all possible outcomes must add up to 1.

P(Heads) + P(Tails) = 0.5 + 0.5 = 1.0  ✓

This is the first rule of probability: probabilities of all outcomes sum to 1.

Example: Rolling a fair die

A die has 6 faces: 1, 2, 3, 4, 5, 6.

P(rolling a 3) = 1/6 ≈ 0.167
P(rolling an even number) = 3/6 = 0.5   (faces 2, 4, 6)
P(rolling a number > 4) = 2/6 ≈ 0.333   (faces 5, 6)

Check: P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 6 × (1/6) = 1.0 ✓

5. The general rule

If an experiment has outcomes o₁, o₂, …, oₙ, then:

P(oᵢ) = (number of ways oᵢ can happen) / (total number of equally likely outcomes)

And always:

P(o₁) + P(o₂) + ... + P(oₙ) = 1

The probability of an event not happening:

P(not A) = 1 - P(A)

Example: If P(rain tomorrow) = 0.3, then P(no rain tomorrow) = 1 - 0.3 = 0.7.

6. A slightly bigger example — monsoon season

In June in Mumbai, it rains on roughly 15 out of 30 days.

P(rain on a random June day in Mumbai) = 15/30 = 0.5
P(no rain) = 1 - 0.5 = 0.5

In September in Rajasthan, it rains on roughly 3 out of 30 days:

P(rain in Rajasthan in September) = 3/30 = 0.1
P(no rain) = 0.9

Now: what if you see dark clouds? The probability of rain given dark clouds is much higher than the base probability. This is conditional probability — covered in the next tutorial. It is the mathematical foundation of how AI models update their beliefs based on context.

7. How probability connects to language models

Here is the most important connection between probability and AI.

A language model learns to predict: given the words so far, what is the probability of each possible next word?

Example: You type “The sun rises in the”

The model assigns probabilities to every word in its vocabulary:

P("east")    = 0.85
P("morning") = 0.08
P("west")    = 0.04
P("sky")     = 0.02
P("banana")  = 0.000001
...

These all add up to 1.0 (across thousands of words).

The model picks “east” — the highest probability word. This is called greedy sampling. (Real models use more sophisticated methods to sometimes pick lower-probability words, so responses feel varied and creative.)

Training goal: Adjust the model’s weights so that the true next word always gets a high probability. This is measured using cross-entropy loss — covered in this tutorial.

8. Where does this appear in AI?

Paper 02 — The Perceptron: The paper’s full title includes “probabilistic model.” Rosenblatt framed the Perceptron’s weights as encoding the probability that a given input connection is relevant. The output — 0 or 1 — is a probabilistic decision.

Papers 10, 11, 12 — GPT-1, BERT, GPT-3: Every output of these models is a probability distribution over the vocabulary. “What is the next word?” is answered by computing probabilities for every word and sampling.

Paper 15 — RLHF: Human feedback is used to adjust the probability that the model produces helpful responses. The entire alignment pipeline is probabilistic.

9. Common mistakes

Probability greater than 1. Probabilities are always between 0 and 1. If your calculation gives P = 1.5, you made an error. Percentages (like 85%) must be converted: 85% = 0.85.
Forgetting that probabilities must sum to 1. If you assign probabilities to all outcomes and they do not sum to 1, something is wrong. Language models are trained specifically to ensure their output probabilities sum to 1 (using the softmax function — covered in its own tutorial).
Confusing probability with frequency. P(heads) = 0.5 does not mean exactly 5 out of every 10 tosses will be heads. It means on average over many tosses, heads appears about half the time. In a small sample, you might get 7 heads in 10 tosses — that is normal.

10. Try it yourself

Exercise 1: A bag has 3 red marbles and 7 blue marbles. You pick one at random. What is P(red)? What is P(blue)? What is P(not red)?

Show answer

Total marbles = 10.

P(red) = 3/10 = 0.3

P(blue) = 7/10 = 0.7

P(not red) = 1 - P(red) = 1 - 0.3 = 0.7 (same as P(blue), which makes sense)

Check: P(red) + P(blue) = 0.3 + 0.7 = 1.0 ✓

Exercise 2: A language model is predicting the next word after “Sachin Tendulkar is a great”. It assigns these probabilities:

Word	Probability
”cricketer”	0.70
”player”	0.20
”batsman”	0.08
”person”	0.02

Do these sum to 1? Which word will the model pick with greedy sampling? What is P(not “cricketer”)?

Show answer

Sum: 0.70 + 0.20 + 0.08 + 0.02 = 1.0 ✓

Greedy sampling picks the highest probability word: “cricketer” (0.70)

P(not “cricketer”) = 1 - 0.70 = 0.30

Exercise 3: In a class of 40 students, 12 scored above 90 in the last maths exam. What is the probability that a randomly chosen student scored above 90? If you pick two students randomly (one at a time, putting the first back), what is the probability that both scored above 90?

Show answer

P(above 90) = 12/40 = 0.3

P(both above 90) = P(first above 90) × P(second above 90) = 0.3 × 0.3 = 0.09

(This uses the multiplication rule for independent events — when one outcome does not affect the next.)

Coming soon: Probability Visualiser →

Flip coins, roll dice, pick marbles. See probability as relative frequency across many trials.

Next tutorial: Conditional Probability → Also needed for Paper 02: Vectors → · Dot Product → Used in: Paper 02 — The Perceptron →

Probability Basics

1. What is this and why do we care?

2. Prerequisites

3. The intuition — before any symbols

4. A tiny worked example with real numbers

5. The general rule

6. A slightly bigger example — monsoon season

7. How probability connects to language models

8. Where does this appear in AI?

9. Common mistakes

10. Try it yourself

Interactive widget