Context: The Transformer Speed Problem

The Quadratic Complexity Bottleneck

By 2023, Transformers dominated AI. But there was a nagging problem: attention is O(n²).

For a sequence of length n tokens:

Computing attention scores: n × n matrix multiplications
For n = 2,000 (typical document): 4 million operations
For n = 100,000 (long document): 10 billion operations
For n = 1,000,000 (novel + code): 1 trillion operations

This is why:

Inference is slow: Generating each new token requires attention over all previous tokens
Memory explodes: You need to store the full attention matrix (n²) in GPU memory
Long sequences become impractical: 100K-token context windows cost millions of times more compute than 100-token windows

By contrast, humans process language sequentially: word by word, forgetting irrelevant details, remembering key points. This is roughly O(n), not O(n²).

Prior Attempts to Escape Quadratic Attention

Before Mamba, researchers tried several approaches:

1. Sparse Attention Patterns

(Longformer, BigBird, Performer)

Instead of full attention, attend only to:

Nearby tokens (sliding window)
Periodic tokens (every 10th token)
Randomly sampled tokens

Problem: Still slower than ideal, and the sparsity patterns are fixed (not adaptive to content).

2. Linear Attention Approximations

(Linear Transformers, FAVOR+)

Approximate attention using kernel tricks or low-rank factorization:

Attention(Q, K, V) ≈ φ(Q) @ φ(K)^T @ V  (where φ is a feature map)

Problem: Loses expressiveness; doesn’t match Transformer quality on language modeling.

3. State Space Models (S4, H3)

(Gu et al., 2021–2022)

Use the SSM framework: x’(t) = Ax(t) + Bu(t), y(t) = Cx(t)

✓ Linear time O(n)
✓ Fixed memory per step (doesn’t grow with sequence length)
✗ Fixed A, B, C matrices — the model can’t decide what to remember

All tokens are treated equally. A token about “the capital of France is Paris” decays at the same rate as “the color of my socks.” The model can’t prioritize.

The State Space Model Framework (Refresher)

A linear dynamical system in continuous time:

x'(t) = A x(t) + B u(t)    (state update)
y(t) = C x(t) + D u(t)     (output)

Where:
  x(t) ∈ ℝ^N           (hidden state, size N)
  u(t) ∈ ℝ             (input)
  A ∈ ℝ^(N×N)          (state transition matrix)
  B ∈ ℝ^(N×1)          (input matrix)
  C ∈ ℝ^(1×N)          (output matrix)
  D ∈ ℝ                (feedthrough, usually 0)

The intuition:

x(t) is the model’s “memory” of past inputs
A controls how this memory decays (A’s eigenvalues control stability)
B controls how new input u(t) affects the state
C controls how the state is converted to output
The system naturally runs in O(n) time (one step per input token)

Example: A River

x(t) = amount of water in the river at time t
u(t) = rainfall at time t

x'(t) = -0.1 × x(t) + 1.0 × u(t)
  (water decreases naturally, rainfall increases it)

y(t) = 1.0 × x(t)
  (output is the water level)

A = [-0.1] (controls decay rate)
B = [1.0]  (rainfall effect)
C = [1.0]  (water level is observable)

If it rains for 10 days, then stops:

The water level rises (u increases x)
Then gradually falls (A decays x)
After a while, the river forgets the rainfall

Why Fixed SSMs Fail on Language

The problem: A, B, C are fixed throughout the sequence.

Imagine the river example again, but the decay rate A = -0.1 is fixed forever:

Scenario 1: "The capital of France is Paris"
  (Important fact, should be remembered)
  But A = -0.1 (fast decay)
  The river forgets quickly!

Scenario 2: "It was a dark and stormy night"
  (Atmospheric flavor text, not important)
  Same A = -0.1 (fast decay)
  The river forgets at the same rate.

The fixed-SSM model can’t distinguish between important and unimportant information.

In Transformers, the attention mechanism allows this distinction. The model can attend strongly to “Paris” and weakly to “stormy.” In SSMs, all tokens contribute equally to the hidden state.

The Key Insight: Selectivity

What if the model could decide, based on the current input, whether to:

Remember (slow decay, high B)
Forget (fast decay, low B)
Update carefully (low B)
Update aggressively (high B)

This is the core insight: make A, B, C functions of the input.

But wait — if A, B, C vary per token, how do we maintain O(n) efficiency?

That’s where Mamba’s hardware-aware algorithm comes in.

The 2023 Landscape: Why Mamba Now?

By late 2023:

Transformers ruled but were slow for long sequences
Gemini 1.5 promises 1M tokens (but still uses Transformers)
SSMs (S4) existed but were underpowered due to fixed parameters
Researchers wondered: Can we get the best of both worlds?

Mamba answers: Yes, with input-dependent state transitions and hardware optimization.

Next: The Problem: Why Fixed SSMs Can’t Compete