Scaled dot-product attention
`Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V`.
Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V. The fundamental operation of the Transformer. "Scaled" refers to the √dₖ divisor. "Dot-product" refers to Q·Kᵀ (matrix of all pairwise dot products). This is faster than Bahdanau's additive attention and maps directly to efficient GPU matrix operations.
The core attention mechanism: Scores = Q @ K^T / √d_head, then softmax(Scores) @ V. This computes which tokens are relevant to the current token. Scaling by 1/√d_head stabilises training. Standard in all Transformer variants including Mistral.