Matrix Multiplication

1. What is this and why do we care?

If the dot product is the heartbeat of neural networks, matrix multiplication is the whole circulatory system. Every single layer in every single neural network you have heard of — perceptron, LSTM, Transformer, Claude, GPT, Gemini — is, underneath, a matrix multiplication followed by a nonlinearity.

Once you understand this one operation, the internals of large language models stop looking intimidating. You are no longer watching a magic box. You are watching a very, very large sequence of matrix multiplications.

2. Prerequisites

You need two things:

If the dot product feels fuzzy, go back. Matrix multiplication is, quite literally, just a lot of dot products arranged in a grid.

3. The intuition — before any symbols

Imagine you run a small coaching institute in Patna with three students — Ravi, Priya, and Anjali. Each student just took three tests: Math, Physics, Chemistry.

Their scores go in a grid:

              Math   Physics   Chem
Ravi     [    85      78        92   ]
Priya    [    70      90        88   ]
Anjali   [    95      85        80   ]

This grid is a matrix: a rectangle of numbers. It has 3 rows (one per student) and 3 columns (one per subject). We say it is a “3 × 3” matrix (always “rows × columns”).

Now the institute wants to compute each student’s weighted final score, using different weights per subject for three different scholarship schemes:

                Scheme A   Scheme B   Scheme C
Math weight  [    0.5        0.3        0.2     ]
Physics      [    0.3        0.4        0.3     ]
Chemistry    [    0.2        0.3        0.5     ]

This is another matrix — 3 rows (one per subject) × 3 columns (one per scheme).

Question: for each (student, scheme) pair, what is the weighted score?

For Ravi in Scheme A: 85×0.5 + 78×0.3 + 92×0.2 = 42.5 + 23.4 + 18.4 = 84.3.

That was a dot product between Ravi’s row and Scheme A’s column.

We need to compute this for every student-scheme combination — 3 students × 3 schemes = 9 dot products. The answer comes out as a new matrix, 3 × 3, where each cell is one of those dot products.

That is matrix multiplication. Row of the first matrix, dotted with column of the second matrix, placed in the row×column cell of the output.

4. The formal definition

If A is an (m × k) matrix and B is a (k × n) matrix, then the product C = A × B is an (m × n) matrix where:

C[i][j] = dot product of (row i of A) and (column j of B)
        = A[i][0]·B[0][j] + A[i][1]·B[1][j] + … + A[i][k-1]·B[k-1][j]

Three rules from this definition that you must burn into memory:

Inner dimensions must match. A is m×k, B is k×n — the two ks must agree. If they don’t, the multiplication is undefined.
The output shape is the outer dimensions. (m × k) × (k × n) → (m × n).
Matrix multiplication is NOT commutative. A × B ≠ B × A in general. In fact, often one of the two is not even defined (shapes don’t match).

5. A worked numeric example

Let’s do one from scratch, slowly.

A = [  1   2  ]      B = [  5   6  ]
    [  3   4  ]          [  7   8  ]

Both are 2×2. Inner dimension is 2 = 2 ✓. Output shape is 2×2.

C[0][0] = row 0 of A · column 0 of B = (1, 2) · (5, 7) = 1·5 + 2·7 = 5 + 14 = 19.

C[0][1] = row 0 of A · column 1 of B = (1, 2) · (6, 8) = 1·6 + 2·8 = 6 + 16 = 22.

C[1][0] = row 1 of A · column 0 of B = (3, 4) · (5, 7) = 3·5 + 4·7 = 15 + 28 = 43.

C[1][1] = row 1 of A · column 1 of B = (3, 4) · (6, 8) = 3·6 + 4·8 = 18 + 32 = 50.

C = A × B = [  19   22  ]
            [  43   50  ]

Now try the other order:

(B × A)[0][0] = (5, 6) · (1, 3) = 5 + 18 = 23.

Already we see 23 ≠ 19, so A × B ≠ B × A. This is worth feeling in your bones: order matters in matrix multiplication.

6. Why this is everywhere in neural networks

A single neural-network layer looks like this:

y = W · x + b

x is the input vector (say, a 784-dimensional image pixel vector).
W is a weight matrix (say, 128 × 784 — 128 output neurons, 784 inputs).
b is a bias vector (128 numbers).
y is the output vector (128 numbers).

W · x is a matrix-vector multiplication: a 128 × 784 matrix times a 784 × 1 vector, giving a 128 × 1 vector. Each of the 128 output numbers is a dot product of one row of W with x.

In batched form, inputs come as a matrix too: X is (batch_size × 784). Then W · Xᵀ computes all outputs for all inputs at once. This batching — running thousands of dot products in parallel — is exactly why GPUs are so good at neural nets. Matrix multiplication is the operation GPUs are optimised for.

The transformer layer you will read about in Paper 08 is literally a stack of matrix multiplications with small twists. When people say GPT-4 does 10²⁵ floating-point operations during training, almost all of those operations are multiplications and additions inside matrix multiplications.

7. An Indian-life analogy — the ration shop

Here is another way to feel matrix multiplication.

You run a ration shop. You serve four customers — let’s call them rows — and you sell three items — wheat, rice, dal. Matrix A is “quantity each customer buys”:

              wheat   rice   dal
Ramu     [      5      3      2   ]
Sita     [      2      4      1   ]
Kareem   [      6      1      3   ]
Meera    [      4      2      2   ]

Matrix B is “price-per-unit in each of 2 currencies” (say rupees and the local grain-barter unit):

          rupees   barter
wheat  [    40      3.0   ]
rice   [    50      4.0   ]
dal    [    90      6.5   ]

A is 4 × 3. B is 3 × 2. Inner dimension 3 matches. Output is 4 × 2 — a bill for each customer in each currency.

The first row of the output, Ramu’s bill:

In rupees: 5·40 + 3·50 + 2·90 = 200 + 150 + 180 = 530.
In barter: 5·3.0 + 3·4.0 + 2·6.5 = 15 + 12 + 13 = 40.

Each cell of the output is one dot product between one customer’s purchase row and one currency’s price column. The whole bill-sheet is one matrix multiplication.

This is literally what a neural-network layer does. A weight matrix rows are “neurons”. An input matrix columns are “features”. Each output cell is a weighted sum — one dot product — and the whole layer’s output is one matrix multiplication.

8. Common shapes to recognise

You’ll see these shapes thousands of times in papers:

Operation	Input shapes	Output shape	Meaning
Matrix × vector	`(m,k) · (k,)`	`(m,)`	One layer of a small network
Matrix × matrix	`(m,k) · (k,n)`	`(m,n)`	Batched layer, or compose two linear transforms
Vector · vector (dot)	`(k,) · (k,)`	scalar	Similarity score
Outer product	`(m,) · (1,n)`	`(m,n)`	Rank-1 update

Also note: transpose of a matrix swaps rows and columns. If A is m×n, then Aᵀ is n×m. Transposing is often needed to make the inner dimensions agree before multiplication. We cover transpose in its own tutorial.

9. A tiny Python check

Paste this into Google Colab to verify the worked example from Section 5:

import numpy as np                     # NumPy for array math
A = np.array([[1, 2], [3, 4]])          # our 2x2 matrix A
B = np.array([[5, 6], [7, 8]])          # our 2x2 matrix B
print(A @ B)                            # @ is matrix multiplication
# Expected: [[19 22], [43 50]]
print(B @ A)                            # different order, different answer
# Expected: [[23 34], [31 46]]

@ is Python’s matrix-multiplication operator (added in Python 3.5). It calls the same optimised routines that PyTorch and TensorFlow use internally — which are, in turn, heavily-optimised versions of the definition we wrote in Section 4.

10. Pitfalls students hit

Mixing up rows and columns. The standard convention is “row × column”. A[2][5] means row 2, column 5. Always check which axis you’re indexing.
Forgetting the inner-dimension rule. If someone says “multiply A by B” and the shapes don’t match, you probably need to transpose one of them. Papers often write Wᵀx instead of Wx for this reason.
Expecting commutativity. A × B ≠ B × A. This trips up everyone at first. Trust the definition: rows of the left, columns of the right.
Confusing element-wise product with matrix multiplication. Element-wise (Hadamard) product multiplies corresponding cells, requires identical shapes, and is written A ⊙ B or A * B in NumPy. Matrix multiplication is something else entirely and uses @ or np.matmul in NumPy.

11. Self-check

Try these with pen and paper. Answers below.

Q1. What is the shape of (3 × 5) × (5 × 2)?

Q2. What is the shape of (3 × 5) × (2 × 5)?

Q3. Compute:

[ 2  0 ]   [ 1   3 ]
[ 1  1 ] × [ 2   1 ]  =  ?

Q4. If A is the 4×3 customer-quantity matrix from Section 7, and p is the column vector [40, 50, 90]ᵀ (prices), what is A · p and what does it represent?

Answers

Q1. 3 × 2 (inner dimension 5 matches, output is the outer dimensions).

Q2. Undefined — inner dimensions are 5 and 2, they don’t match. You would need to transpose the second matrix to make it 5×2 first.

Q3.

[ 2·1 + 0·2    2·3 + 0·1 ]   [ 2   6 ]
[ 1·1 + 1·2    1·3 + 1·1 ] = [ 3   4 ]

Q4. A · p is a 4×1 column vector where each entry is one customer’s total rupee bill:

[ 5·40 + 3·50 + 2·90 ]   [ 530 ]
[ 2·40 + 4·50 + 1·90 ] = [ 370 ]
[ 6·40 + 1·50 + 3·90 ]   [ 560 ]
[ 4·40 + 2·50 + 2·90 ]   [ 440 ]

Each entry is a dot product. The whole operation is matrix-vector multiplication.

12. Where this shows up next

You’ll meet matrix multiplication in every remaining paper from 05 onwards. The shapes get bigger, but the operation is exactly the one you just learned:

Paper 05 (Word2Vec): input one-hot vectors are multiplied by an embedding matrix to look up word vectors.
Paper 08 (Transformer): query, key, value are all computed by multiplying the input by three different weight matrices. Attention itself is Q · Kᵀ — a matrix multiplication.
Paper 12 (GPT-3): 175 billion parameters is 175 billion numbers sitting inside matrices, waiting to be multiplied.

Every time you see W · x in a paper from here on, you now know exactly what is happening and why the shapes must line up. The rest is scale.

Back to Linear Algebra tutorials · Vectors · Dot product