Memory Scaling

Appears in 1 paper

With P GPUs using Ring Attention, per-GPU memory is O((n/P) × d), scaling linearly with the number of GPUs.

As used in Paper 19 — Ring Attention with Blockwise Transformers for Near-Infinite Context →

With P GPUs using Ring Attention, per-GPU memory is O((n/P) × d), scaling linearly with the number of GPUs. Contrast: single-GPU attention requires O(n × d). Enables processing sequences of arbitrary length by adding GPUs.