Rotary Position Embeddings (RoPE)
A method of encoding token position information by rotating query and key vectors.
A method of encoding token position information by rotating query and key vectors. Mistral uses RoPE instead of absolute position embeddings. RoPE is more efficient and generalises better to sequence lengths longer than training sequences (though Mistral's SWA means extrapolation is still limited).