Ring Attention

Appears in 1 paper

A distributed attention algorithm where P GPUs are arranged in a ring topology.

As used in Paper 19 — Ring Attention with Blockwise Transformers for Near-Infinite Context →

A distributed attention algorithm where P GPUs are arranged in a ring topology. Each GPU holds a chunk of the KV cache (n/P tokens). KV chunks circulate around the ring while each GPU computes blockwise attention. After P rounds, every GPU has computed full attention. Enables true long-context (1M+ tokens) through memory and computation distributed across GPUs.