Parallel Scan
A hardware-friendly algorithm (e.g., Blelloch scan) that computes a recurrence y_t = f(x_{t-1}, u_t) in parallel by decomposing it into a tree of subproblems.
A hardware-friendly algorithm (e.g., Blelloch scan) that computes a recurrence y_t = f(x_{t-1}, u_t) in parallel by decomposing it into a tree of subproblems. Mamba uses this during training to make the recurrence parallelizable on GPUs, achieving O(n log n) training time.