Ring Attention with Blockwise Transformers for Near-Infinite Context

The fundamental problem with Transformer attention is that it requires all key-value pairs to fit on a single GPU. For a sequence of 1 million tokens with standard attention, the KV cache alone would demand hundreds of gigabytes of memory — far beyond any single GPU.

Ring Attention solves this by distributing sequences across multiple GPUs arranged in a ring topology. Each GPU holds a chunk of the sequence; as computation proceeds, KV chunks circulate around the ring device-by-device. Computation and communication overlap, hiding latency. The result: context length scales with the number of GPUs.

Ring Attention enabled models like Gemini 1.5 Pro to handle 1-million-token contexts. It represents a fundamental shift: instead of asking “how long a sequence can fit on one GPU?”, ask “how many GPUs can we use, and scale context accordingly?”

What this paper did

The core contribution: A distributed attention algorithm where:

P GPUs are arranged in a ring topology
Each GPU holds a chunk of the KV cache (n/P tokens)
Each GPU computes attention for its query block against all KV blocks by circulating KV chunks around the ring
Computation overlaps with communication, hiding communication latency
Result: context length grows with P, not bounded by single-GPU memory
Sequence length can be 1 million or more tokens (tested up to 4 million)

Key equations (informal):

Memory per GPU: O((n/P) × d) — scales with sequence / num_GPUs
Computation: Total O(n²d) distributed across P devices
Communication: O(n × d × P) — each KV chunk traverses ring P times
Per-GPU work per iteration: O((n/P)² × d) locally + O(n × d) communication

Effective context = n, where n can be arbitrarily large as P increases

The Indian analogy

Imagine a very long cricket match scorecard (1 million records) that’s too long for one analyst to manage. You have 8 analysts sitting in a circle.

Without Ring Attention: All 8 analysts need a copy of the entire scorecard. Memory explosive.

With Ring Attention: Each analyst holds 1/8 of the scorecard. They pass their section to the next analyst clockwise. While analyst 1 is analyzing their section against the section analyst 8 just handed them, analyst 2 is simultaneously receiving analyst 1’s section. By the time the scorecard has circulated the full ring, every analyst has computed their piece of the answer — and no one was sitting idle waiting for data.

The key: synchronised passing and computing means communication latency is hidden by overlap.

Comparison: Context Handling Approaches

Approach	Context Size	Memory Per GPU	Latency	How It Works
Standard Attention	Limited by GPU VRAM	O(n × d)	O(n²)	All tokens on one GPU
Mistral SWA	W tokens (4–8K)	O(W × d)	O(n × W)	Local window only
Ring Attention	n (unbounded)	O((n/P) × d)	O(n² / P)	Distributed across P GPUs
Sparse Attention	n (unbounded)	O(n × d)	O(n log n)	Limited patterns
KV Compression	n (unbounded)	O(c × d), c << n	O(n × c)	Lossy compression

Ring Attention’s advantage: clean memory scaling (linear with P) + full attention expressiveness (not sparse, not windowed, not lossy).

Read in this order

Section	What you will learn	Difficulty	Time
01 Context	Why long sequences are hard; single-GPU limits	🟡 Intermediate	8 min
02 The Problem	Distributed attention is challenging; communication bottleneck	🔴 Advanced	10 min
03 The Idea	Ring topology; blockwise attention; compute-comm overlap	🔴 Advanced	12 min
04 The Math	Formal algorithm, complexity analysis, numerical examples	🔴 Advanced	14 min
05 Worked Example	Step-by-step ring circulation on 6-token sequence	🟡 Intermediate	10 min
06 The Code	Python simulation of ring attention; Colab-runnable	🟢 Beginner	6 min
07 Limitations	Ring topology constraints, heterogeneity, causal masking	🔴 Advanced	8 min
08 Impact	Gemini 1.5, context parallelism, million-token models	🟢 Beginner	8 min
09 Summary	One-sentence recap; what to read next	🟢 Beginner	2 min

Before you read: Math tutorials you need

Scaled Dot-Product Attention — core attention mechanism
Attention Complexity and KV Cache — why KV cache is the bottleneck
Blockwise Computation — how to compute attention in chunks
Online Softmax and Numerically Stable Attention — key to numerically correct blockwise attention
Distributed Computing Basics — GPUs, communication patterns, synchronisation

Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│         Ring Attention: 4 GPUs Processing 1M Tokens          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  Original Sequence: [t₁...t₂₅₀K, t₂₅₀K+1...t₅₀₀K, ...]    │
│                      (1,000,000 tokens)                      │
│                            ↓                                 │
│  Split across 4 GPUs (250K tokens each)                      │
│                            ↓                                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  GPU Ring Topology:                                 │    │
│  │                                                     │    │
│  │      GPU 0 (tokens 0–249K)                          │    │
│  │         ↙              ↖                            │    │
│  │    GPU 1              GPU 3                         │    │
│  │    (tokens           (tokens                        │    │
│  │     250K–500K)       750K–1M)                       │    │
│  │         ↘              ↙                            │    │
│  │        GPU 2 (tokens 500K–750K)                     │    │
│  │                                                     │    │
│  │  During computation:                                │    │
│  │  Round 1: Each GPU computes Q @ KV[own]            │    │
│  │           KV chunks pass clockwise                  │    │
│  │  Round 2: GPU i receives KV from GPU i-1           │    │
│  │           Computes Q @ KV[i-1]                     │    │
│  │  Round 3, 4: Continue until each GPU has seen all  │    │
│  │           KV chunks                                │    │
│  └─────────────────────────────────────────────────────┘    │
│                            ↓                                 │
│  Output: Full attention computed; every Q attended to       │
│  every KV across entire 1M token sequence                   │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Key Innovations

1. Blockwise Attention: Compute attention in blocks (query block × KV block) using online softmax. Numerically equivalent to full attention but allows sequential processing.

2. Ring Topology: Arrange P GPUs in a ring. Each device passes its KV block to the next. No centralised bottleneck.

3. Compute-Communication Overlap: GPU i computes attention while simultaneously receiving the next KV block. Latency is hidden.

4. Memory Scaling: Instead of O(n) memory on a single GPU, O(n/P) per GPU. As P grows, you can handle arbitrarily long sequences.

← Paper 18: Mistral 7B | Paper 20: Gemini →

Ring Attention with Blockwise Transformers for Near-Infinite Context

Ring Attention with Blockwise Transformers for Near-Infinite Context

What this paper did

The Indian analogy

Comparison: Context Handling Approaches

Read in this order

Before you read: Math tutorials you need

Architecture Overview

Key Innovations

Navigation

Discussion