Grouped Query Attention (GQA)
A variant of Multi-Head Attention where multiple query heads share the same key-value head.
A variant of Multi-Head Attention where multiple query heads share the same key-value head. Instead of n_heads KV heads, you use fewer n_kv_heads heads. Mistral 7B has 32 query heads but only 8 KV heads, reducing KV cache memory by 4×. GQA maintains quality while dramatically reducing inference memory footprint.