Context: The Transformer Speed Problem
The Quadratic Complexity Bottleneck
By 2023, Transformers dominated AI. But there was a nagging problem: attention is O(n²).
For a sequence of length n tokens:
- Computing attention scores: n × n matrix multiplications
- For n = 2,000 (typical document): 4 million operations
- For n = 100,000 (long document): 10 billion operations
- For n = 1,000,000 (novel + code): 1 trillion operations
This is why:
- Inference is slow: Generating each new token requires attention over all previous tokens
- Memory explodes: You need to store the full attention matrix (n²) in GPU memory
- Long sequences become impractical: 100K-token context windows cost millions of times more compute than 100-token windows
By contrast, humans process language sequentially: word by word, forgetting irrelevant details, remembering key points. This is roughly O(n), not O(n²).
Prior Attempts to Escape Quadratic Attention
Before Mamba, researchers tried several approaches:
1. Sparse Attention Patterns
(Longformer, BigBird, Performer)
Instead of full attention, attend only to:
- Nearby tokens (sliding window)
- Periodic tokens (every 10th token)
- Randomly sampled tokens
Problem: Still slower than ideal, and the sparsity patterns are fixed (not adaptive to content).
2. Linear Attention Approximations
(Linear Transformers, FAVOR+)
Approximate attention using kernel tricks or low-rank factorization:
Attention(Q, K, V) ≈ φ(Q) @ φ(K)^T @ V (where φ is a feature map)
Problem: Loses expressiveness; doesn’t match Transformer quality on language modeling.
3. State Space Models (S4, H3)
(Gu et al., 2021–2022)
Use the SSM framework: x’(t) = Ax(t) + Bu(t), y(t) = Cx(t)
- ✓ Linear time O(n)
- ✓ Fixed memory per step (doesn’t grow with sequence length)
- ✗ Fixed A, B, C matrices — the model can’t decide what to remember
All tokens are treated equally. A token about “the capital of France is Paris” decays at the same rate as “the color of my socks.” The model can’t prioritize.
The State Space Model Framework (Refresher)
A linear dynamical system in continuous time:
x'(t) = A x(t) + B u(t) (state update)
y(t) = C x(t) + D u(t) (output)
Where:
x(t) ∈ ℝ^N (hidden state, size N)
u(t) ∈ ℝ (input)
A ∈ ℝ^(N×N) (state transition matrix)
B ∈ ℝ^(N×1) (input matrix)
C ∈ ℝ^(1×N) (output matrix)
D ∈ ℝ (feedthrough, usually 0)
The intuition:
- x(t) is the model’s “memory” of past inputs
- A controls how this memory decays (A’s eigenvalues control stability)
- B controls how new input u(t) affects the state
- C controls how the state is converted to output
- The system naturally runs in O(n) time (one step per input token)
Example: A River
x(t) = amount of water in the river at time t
u(t) = rainfall at time t
x'(t) = -0.1 × x(t) + 1.0 × u(t)
(water decreases naturally, rainfall increases it)
y(t) = 1.0 × x(t)
(output is the water level)
A = [-0.1] (controls decay rate)
B = [1.0] (rainfall effect)
C = [1.0] (water level is observable)
If it rains for 10 days, then stops:
- The water level rises (u increases x)
- Then gradually falls (A decays x)
- After a while, the river forgets the rainfall
Why Fixed SSMs Fail on Language
The problem: A, B, C are fixed throughout the sequence.
Imagine the river example again, but the decay rate A = -0.1 is fixed forever:
Scenario 1: "The capital of France is Paris"
(Important fact, should be remembered)
But A = -0.1 (fast decay)
The river forgets quickly!
Scenario 2: "It was a dark and stormy night"
(Atmospheric flavor text, not important)
Same A = -0.1 (fast decay)
The river forgets at the same rate.
The fixed-SSM model can’t distinguish between important and unimportant information.
In Transformers, the attention mechanism allows this distinction. The model can attend strongly to “Paris” and weakly to “stormy.” In SSMs, all tokens contribute equally to the hidden state.
The Key Insight: Selectivity
What if the model could decide, based on the current input, whether to:
- Remember (slow decay, high B)
- Forget (fast decay, low B)
- Update carefully (low B)
- Update aggressively (high B)
This is the core insight: make A, B, C functions of the input.
But wait — if A, B, C vary per token, how do we maintain O(n) efficiency?
That’s where Mamba’s hardware-aware algorithm comes in.
The 2023 Landscape: Why Mamba Now?
By late 2023:
- Transformers ruled but were slow for long sequences
- Gemini 1.5 promises 1M tokens (but still uses Transformers)
- SSMs (S4) existed but were underpowered due to fixed parameters
- Researchers wondered: Can we get the best of both worlds?
Mamba answers: Yes, with input-dependent state transitions and hardware optimization.