Summary: Mamba in one page

One Sentence

Mamba trades attention’s flexibility for linear-time O(n) efficiency by learning which information to remember and which to forget, matching Transformer quality while being 5× faster on long sequences.

The Problem It Solved

Transformers are O(n²) in sequence length.

For a sequence of n tokens:

Memory: quadratic in n
Compute: quadratic in n
Inference: must keep all KV pairs on GPU

At 100K tokens or beyond, Transformers become impractical—memory explodes, speed crawls.

State Space Models (SSMs) are O(n) — but prior SSMs (like S4) had fixed parameters, treating all tokens equally. They couldn’t adapt to content.

Key Ideas

1. Selective State Space Model (Selective SSM)

Instead of fixed A, B, C matrices → make B, C, and Δ (step size) input-dependent
At each step, the model decides: “Is this token important? Remember it longer (slow decay). Or is it noise? Forget it fast (fast decay).”
Decay is controlled by eigenvalues of A; Mamba learns to set them dynamically

2. Content-Aware Memory

Mamba’s hidden state is a summary that selectively compresses information
Important facts: slow decay (large eigenvalues)
Noise: fast decay (small eigenvalues)
Result: better use of fixed-size state vs. fixed-rate SSMs

3. Hardware-Aware Algorithm

Uses a parallel scan (Blelloch scan) during training to parallelize the recurrence
Switches to recurrence mode during inference (O(1) per token)
Custom CUDA kernels needed; can’t just use PyTorch ops

4. Linear Recurrence at Inference

Training: O(n log n) via FFT or parallel scan
Inference: O(1) per token generated, O(1) memory per token
No KV cache needed; state stays constant size

Key Numbers

Metric	Value
Speedup over Transformer (2K+ tokens)	5× faster
Memory at 1M tokens	O(1), constant
Transformer memory at 1M tokens	O(n), explodes
Language modeling (Chinchilla 7B)	Matches or beats Transformer
HumanEval (code)	57% (Mamba) vs 55% (Transformer)
Largest open Mamba model	7B parameters (as of 2025)
Training time	Similar to Transformer (not faster)

Indian Analogy Recap

A student processing a long river of information. Instead of stopping at every word to compare it to every previous word (O(n²), exhausting), the student learned to say: “I’ll remember this concept deeply (important for exams)” vs. “I’ll skim over this filler” (can forget).

Result: Same understanding, much faster processing.

What Came Next

Mamba 2 (Dao & Gu, 2024)

Theoretical reformulation connecting SSMs to attention
2–8× faster training
Deeper understanding of why selective SSMs work

Jamba (AI21 Labs, 2024)

Hybrid model: alternates Mamba and Attention blocks
First commercial Mamba-based LLM
Proof that hybrids are the practical future

Hybrid Architectures

StripedHyena (Together AI)
Upcoming LLaMA variants with Mamba layers
Future: strategic mixing, not pure Mamba or pure Attention

When to Use Mamba

Use Mamba (or Jamba hybrid):

Processing very long sequences (10K+ tokens)
On-device inference (limited memory)
Real-time systems where latency per token matters
Building efficient models with small compute budgets

Use Transformers:

Most NLP tasks (text under 4K tokens)
Retrieval-heavy problems (“needle in haystack”)
Vision-language (images + text)
Fine-tuning on large-scale instruction data
When you need mature ecosystems (Hugging Face, vLLM, etc.)

Limitations Worth Knowing

Poor in-context recall — Can’t point back to token position 5000 like Attention can
Sequential inference — Must generate one token at a time (hard to parallelize)
Training not faster — Speed gain is purely at inference on long sequences
Limited ecosystem — Fewer libraries, tools, pre-trained models than Transformers
Unproven at scale — Largest Mamba models are 7B; Transformers scale to 700B+

The Verdict

Mamba is not a Transformer killer. It’s a complementary approach that proved:

✓ Linear-time sequence modeling works
✓ Selectivity beats inflexibility
✓ Hybrid architectures are the practical future
✗ But it’s not universally better; context-specific trade-offs matter

In 2025, the field is settling on hybrid models (Mamba + Attention blocks) for balanced efficiency and capability. Pure Mamba shines in niche (but important) use cases.

← Paper 20: Gemini | Paper 22: Claude Model Card →