Matrix Transpose

1. What is this and why do we care?

One of the most important equations in all of modern AI is:

Attention(Q, K, V) = softmax(Q · Kᵀ / √dₖ) · V

That small superscript ᵀ on the K is the transpose. Without understanding what it means, the Transformer paper (Paper 08) is unreadable.

The matrix transpose is one of the simplest operations in linear algebra — it takes about two minutes to understand — but it appears everywhere in AI: in attention mechanisms, in computing gradients during backpropagation, in the geometry of neural networks.

2. Prerequisites

You need to know what matrices are and how to multiply them. Read Matrix Multiplication first if you have not.

3. The intuition — before any symbols

Imagine a seating chart for a class of students. You have a 3-row × 4-column grid:

Original chart (3 rows, 4 columns):

       Col1  Col2  Col3  Col4
Row1 [  A     B     C     D  ]
Row2 [  E     F     G     H  ]
Row3 [  I     J     K     L  ]

The transpose is what you get when you rotate the chart 90° — rows become columns, columns become rows:

Transposed chart (4 rows, 3 columns):

       Col1  Col2  Col3
Row1 [  A     E     I  ]
Row2 [  B     F     J  ]
Row3 [  C     G     K  ]
Row4 [  D     H     L  ]

What was in row 1 is now in column 1. What was in column 2 is now in row 2. The element that was at position (row 2, col 3) — which was G — is now at position (row 3, col 2). The grid is flipped along its main diagonal (top-left to bottom-right).

That is the entire idea.

4. A tiny worked example with real numbers

Let matrix A be:

A = [ 1  2  3 ]
    [ 4  5  6 ]

A has 2 rows and 3 columns. We write its shape as (2 × 3).

The transpose Aᵀ is formed by turning every row into a column:

Aᵀ = [ 1  4 ]
     [ 2  5 ]
     [ 3  6 ]

Aᵀ has 3 rows and 2 columns — shape (3 × 2). The shape flipped.

The rule: Element at position (i, j) in A moves to position (j, i) in Aᵀ.

Check: A[row 1, col 2] = 2. In Aᵀ it should be at [row 2, col 1] = 2. ✓ Check: A[row 2, col 3] = 6. In Aᵀ it should be at [row 3, col 2] = 6. ✓

5. The general rule

If A is an (m × n) matrix, then Aᵀ is an (n × m) matrix.

For each element:

(Aᵀ)ᵢⱼ = Aⱼᵢ

The row index and column index are swapped.

Properties of transpose:

Transposing twice gives you back the original: (Aᵀ)ᵀ = A
(A + B)ᵀ = Aᵀ + Bᵀ
(AB)ᵀ = BᵀAᵀ ← note the order reversal — this matters!
If A is square and Aᵀ = A, A is called a symmetric matrix

6. Transpose and the dot product

The dot product of two vectors a and b can be written as a matrix multiplication:

a · b = aᵀ b

Here a is treated as a column vector (n × 1 matrix). aᵀ is a row vector (1 × n matrix). Their product is (1 × n) × (n × 1) = (1 × 1) — a single number. That is the dot product.

This notation is used everywhere in ML papers. When you see wᵀx in a paper, it means the dot product of vector w with vector x.

7. Why Q · Kᵀ in the Transformer?

In the Transformer, Q (queries) is a matrix of shape (sequence_length × dₖ). Each row is a query vector for one position.

K (keys) is also (sequence_length × dₖ). Each row is a key vector.

We want to compute a dot product between every query and every key — that gives us the attention scores.

If we compute Q · Kᵀ:

Q  shape: (seq_len × dₖ)
Kᵀ shape: (dₖ × seq_len)     ← K transposed

Q · Kᵀ shape: (seq_len × seq_len)

The result is a (seq_len × seq_len) matrix where entry [i, j] is the dot product of query i with key j — exactly the attention score between position i and position j. Without the transpose, the matrix multiplication would not be defined (wrong dimensions).

Numerical example with tiny matrices:

Q = [ 1  2 ]     K = [ 0  1 ]
    [ 3  1 ]         [ 2  0 ]
    [ 0  2 ]         [ 1  1 ]

Kᵀ = [ 0  2  1 ]
     [ 1  0  1 ]

Q · Kᵀ = [ (1×0 + 2×1)   (1×2 + 2×0)   (1×1 + 2×1) ]
          [ (3×0 + 1×1)   (3×2 + 1×0)   (3×1 + 1×1) ]
          [ (0×0 + 2×1)   (0×2 + 2×0)   (0×1 + 2×1) ]

       = [ 2   2   3 ]
         [ 1   6   4 ]
         [ 2   0   2 ]

Entry [1,2] = 2: query 1 (row [1,2]) dotted with key 2 (row [2,0]) = 1×2 + 2×0 = 2. Entry [2,2] = 6: query 2 (row [3,1]) dotted with key 2 (row [2,0]) = 3×2 + 1×0 = 6.

These six numbers are the raw attention scores from which softmax produces the attention weights. Row 2 query has a strong preference for key 2 (score 6), and the model will attend mostly to position 2 when processing position 2.

8. Common mistakes

Confusing Aᵀ and A⁻¹. The transpose and the inverse are completely different operations. Aᵀ just flips rows and columns. A⁻¹ (inverse) is the matrix equivalent of dividing — it exists only for square matrices and is much harder to compute. They look similar as superscripts, but mean completely different things.
Forgetting the shape flip. A (3 × 5) matrix becomes (5 × 3) after transposing. Students sometimes remember the elements moved but forget the shape changed, leading to dimension errors in matrix multiplication.
(AB)ᵀ = BᵀAᵀ, not AᵀBᵀ. The order reverses. This trips up even experienced practitioners.

9. Try it yourself

Exercise 1: Transpose this matrix:

M = [ 5   1   9 ]
    [ 2   8   4 ]

Show answer

Mᵀ = [ 5   2 ]
     [ 1   8 ]
     [ 9   4 ]

Original shape (2 × 3) → transposed shape (3 × 2).

Exercise 2: In the Transformer, Q has shape (5 × 64) and K has shape (5 × 64). What is the shape of Q · Kᵀ, and what does each entry represent?

Show answer

Kᵀ has shape (64 × 5).

Q · Kᵀ has shape (5 × 5).

Each entry [i, j] is the dot product of query i with key j — the raw attention score measuring how relevant position j is to position i. This 5×5 matrix goes through softmax row-by-row to give the attention weight matrix.

Exercise 3: Compute Q · Kᵀ for:

Q = [ 2  1 ]     K = [ 1  0 ]
    [ 0  3 ]         [ 1  2 ]

Which position-pair has the highest attention score?

Show answer

Kᵀ = [ 1  1 ]
     [ 0  2 ]

Q · Kᵀ = [ (2×1 + 1×0)   (2×1 + 1×2) ]   =   [ 2   4 ]
          [ (0×1 + 3×0)   (0×1 + 3×2) ]       [ 0   6 ]

Highest score: entry [2, 2] = 6, meaning query 2 attending to key 2 has the strongest raw match. After softmax, position 2 will pay most attention to itself.

Previous tutorial: Matrix Multiplication ← Next tutorial: Softmax Function → Used in: Paper 08 — Transformer →