Head (attention head)
One of h = 8 parallel attention computations in multi-head attention, each operating in a lower-dimensional subspace (dₖ = d_model / h).
One of h = 8 parallel attention computations in multi-head attention, each operating in a lower-dimensional subspace (dₖ = d_model / h). Each head has its own Q, K, V projection matrices and learns to attend to different relationships. Outputs are concatenated and projected by W^O.