Blockwise Attention
Computing attention in blocks (query chunk × KV chunk) rather than all-at-once.
Computing attention in blocks (query chunk × KV chunk) rather than all-at-once. Numerically equivalent to full attention when using online softmax. Essential for Ring Attention, where each GPU processes its Q chunk against KV chunks sequentially.