Context Parallelism
Distributing a long sequence across multiple GPUs along the sequence dimension (distinct from data, tensor, or pipeline parallelism).
Distributing a long sequence across multiple GPUs along the sequence dimension (distinct from data, tensor, or pipeline parallelism). Ring Attention is the primary implementation of context parallelism. Enables training and inference on sequences far exceeding single-GPU memory.