Section 05

The Mathematics

First Learning Machine The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain 1958

The Mathematics

This is the first paper on Ainiketan where real mathematics appears. Do not be intimidated. Every symbol is explained below, and there is a worked example with actual numbers so you can verify everything with pen and paper.


Mathematical concepts used in this paper


Concept: Vectors Why needed: The inputs to a Perceptron (pixel values, sensor readings, features) are a list of numbers — which is exactly what a vector is. Thinking of inputs as vectors lets us use compact notation and reason about them geometrically. Where in paper: Every input to the Perceptron is a vector x = [x₁, x₂, …, xₙ] Tutorial: Vectors — Introduction


Concept: Dot Product Why needed: The weighted sum — the core computation of the Perceptron — is exactly the dot product of the weight vector and the input vector. Where in paper: The forward pass: output = sign(w · x − θ) Tutorial: Dot Product


Concept: Probability Basics Why needed: Rosenblatt framed the Perceptron as a probabilistic model. The word “probabilistic” is literally in the paper’s title. He thought of the weights as encoding probabilities of connection strengths in a biological network. Where in paper: Throughout the theoretical framing of the paper Tutorial: Probability Basics


The key equation: the Perceptron’s output

The Perceptron’s prediction is:

ŷ = 1   if  (w₁x₁ + w₂x₂ + ... + wₙxₙ) ≥ θ
ŷ = 0   if  (w₁x₁ + w₂x₂ + ... + wₙxₙ) < θ

Where:

  • x₁, x₂, …, xₙ = the input features (numbers describing the example)
  • w₁, w₂, …, wₙ = the weights (how important each feature is; these are learned)
  • θ (theta) = the threshold (the minimum weighted sum needed to output 1)
  • ŷ (y-hat) = the Perceptron’s prediction (0 or 1)

In compact vector notation:

ŷ = 1   if   w · x ≥ θ
ŷ = 0   if   w · x < θ

Where w · x means the dot product of vectors w and x.


The key equation: the learning rule

When the Perceptron makes a mistake, it updates each weight:

wᵢ ← wᵢ + η × (y − ŷ) × xᵢ

Where:

  • wᵢ = the weight being updated (weight for input i)
  • η (eta) = the learning rate (a small positive number, e.g. 0.1)
  • y = the correct answer (0 or 1, provided in the training data)
  • ŷ = the Perceptron’s prediction (0 or 1)
  • xᵢ = the value of input i for this example
  • (y − ŷ) = the error: +1 if we predicted 0 but answer was 1; −1 if we predicted 1 but answer was 0; 0 if correct

Notice what happens in each case:

  • If correct (y = ŷ): error = 0, so wᵢ ← wᵢ + 0 = wᵢ. No change. ✓
  • If false negative (y = 1, ŷ = 0): error = +1, so wᵢ increases. The Perceptron will be more likely to say 1 next time. ✓
  • If false positive (y = 0, ŷ = 1): error = −1, so wᵢ decreases. The Perceptron will be less likely to say 1 next time. ✓

Worked numerical example — full step by step

We train a Perceptron to learn the OR gate: output 1 if either input is 1.

Training data:

x₁x₂Correct y
000
011
101
111

Initial values: w₁ = 0, w₂ = 0, θ = 0.5, η = 0.1


Epoch 1, Example 1: x = [0, 0], y = 0

Weighted sum = (0 × 0) + (0 × 0) = 0
0 < 0.5 → ŷ = 0
Error = y − ŷ = 0 − 0 = 0  → No update

Weights unchanged: w₁ = 0, w₂ = 0


Epoch 1, Example 2: x = [0, 1], y = 1

Weighted sum = (0 × 0) + (0 × 1) = 0
0 < 0.5 → ŷ = 0
Error = 1 − 0 = +1  → UPDATE
w₁ ← 0 + 0.1 × 1 × 0 = 0      (x₁ = 0, so w₁ doesn't change)
w₂ ← 0 + 0.1 × 1 × 1 = 0.1    (x₂ = 1, so w₂ increases)

Weights: w₁ = 0, w₂ = 0.1


Epoch 1, Example 3: x = [1, 0], y = 1

Weighted sum = (0 × 1) + (0.1 × 0) = 0
0 < 0.5 → ŷ = 0
Error = 1 − 0 = +1  → UPDATE
w₁ ← 0 + 0.1 × 1 × 1 = 0.1    (x₁ = 1, so w₁ increases)
w₂ ← 0.1 + 0.1 × 1 × 0 = 0.1  (x₂ = 0, so w₂ unchanged)

Weights: w₁ = 0.1, w₂ = 0.1


Epoch 1, Example 4: x = [1, 1], y = 1

Weighted sum = (0.1 × 1) + (0.1 × 1) = 0.2
0.2 < 0.5 → ŷ = 0
Error = 1 − 0 = +1  → UPDATE
w₁ ← 0.1 + 0.1 × 1 × 1 = 0.2
w₂ ← 0.1 + 0.1 × 1 × 1 = 0.2

Weights: w₁ = 0.2, w₂ = 0.2


After Epoch 1, we still have errors. But the weights have grown from 0 to 0.2. After several more epochs, the weights will reach values like w₁ = 0.6, w₂ = 0.6, at which point:

  • (0,0) → sum = 0 < 0.5 → output 0 ✓
  • (0,1) → sum = 0.6 ≥ 0.5 → output 1 ✓
  • (1,0) → sum = 0.6 ≥ 0.5 → output 1 ✓
  • (1,1) → sum = 1.2 ≥ 0.5 → output 1 ✓

The OR gate is learned. Try verifying these by hand on paper.


What the Perceptron Convergence Theorem says

Rosenblatt proved: if the training data is linearly separable, the Perceptron learning rule will always find a set of weights that correctly classifies all training examples, in a finite number of steps.

“Linearly separable” means you can draw a straight line (or, in higher dimensions, a flat plane called a hyperplane) that separates the two classes perfectly.

The AND and OR gates are linearly separable. The XOR gate is not — and that is what broke the Perceptron. We discuss this in Limitations →.


Next: The Code →