Head (attention head)

Appears in 1 paper

One of h = 8 parallel attention computations in multi-head attention, each operating in a lower-dimensional subspace (dₖ = d_model / h).

As used in Paper 08 — Attention Is All You Need →

One of h = 8 parallel attention computations in multi-head attention, each operating in a lower-dimensional subspace (dₖ = d_model / h). Each head has its own Q, K, V projection matrices and learns to attend to different relationships. Outputs are concatenated and projected by W^O.