Limitations: Where Mamba Falls Short
Mamba is innovative and faster than Transformers for long sequences, but it has real limitations that prevent it from replacing Transformers entirely.
1. Poor In-Context Recall
Mamba struggles with exact retrieval of facts from long context.
Example: Finding a Needle in a Haystack
Prompt: "Here are 10,000 random words: [cat, blue, shoe, river, ...]
Where is the word 'xylophone'? (it appears at position 5000)
Answer: "
Transformer (full attention):
Attends directly to position 5000 where "xylophone" appears.
Correctly answers: "The word 'xylophone' appears at position 5000."
Mamba:
By position 5000, the initial token has decayed significantly.
The SSM state doesn't "point back" to the original occurrence.
Often fails to retrieve the exact answer.
Why does this happen? Mamba’s hidden state is a fixed-size summary, not a cache of all tokens.
In Transformers, the attention mechanism explicitly stores (via the KV cache) embeddings of all previous tokens. Mamba’s state dimension is typically much smaller (e.g., 4x the model dimension), so it can’t perfectly retain all information.
Real-World Impact
This is a problem for:
- Legal document review: Finding specific clauses in 100-page contracts
- Medical records: Locating specific symptoms or test results
- Scientific papers: Retrieving exact citations or experimental parameters
Transformers are better suited for these tasks.
2. Sequential Processing (Hard to Parallelize)
Mamba’s recurrent structure makes it hard to parallelize during inference.
Training vs. Inference
During training:
- Mamba can use a parallel scan algorithm to compute all timesteps efficiently
- Or convert the recurrence to a convolution (FFT-based, O(n log n))
- This parallelization is possible but requires custom implementations
During inference (generating one token at a time):
- Must compute x_t = f(x_{t-1}, u_t) sequentially
- Can’t compute multiple future tokens in parallel (unlike Transformers with KV cache)
- Each new token generation costs O(1) time, but you generate tokens one by one
- For generating a 100-token response, Mamba takes 100 sequential steps
Practical Impact
For a user asking a question and waiting for a response:
- Transformer: Generate all 100 tokens in parallel (with batching), display stream
- Mamba: Generate 1 token, then 1 token, then 1 token… (feels “slower” even if total time is similar)
This sequential nature conflicts with modern inference serving (batching, parallelization).
3. Training Overhead
While Mamba claims O(n) inference, training is not necessarily faster than Transformers.
Why?
- Different operations: Training uses FFT-based convolution (O(n log n)), inference uses recurrence (O(n)). They’re different algorithms.
- Custom kernels required: Real Mamba needs specialized CUDA kernels. Generic PyTorch is slow. Transformers use well-optimized attention (Flash Attention, etc.)
- Memory layout: SSM computations may not use GPU memory as efficiently as optimized attention kernels
Reality
Mamba training time ≈ Transformer training time (both ~1 epoch on modern hardware)
Mamba inference (long sequence) ≈ 5x faster than Transformer
Mamba inference (short sequence) ≈ Slower than Transformer (custom kernels have overhead)
Mamba shines for long sequences, not for speeding up training.
4. Reduced Expressiveness
Full attention allows any token to attend to any other token. Mamba’s selectivity is more constrained.
Example: Complex Dependencies
Suppose the task requires:
- Comparing two facts separated by 1000 tokens
- Performing logical inference across multiple passages
- Tracking multiple independent threads
Transformers can (via attention) create a direct connection between any pair of tokens. Mamba must route this through the hidden state, which may lose information.
Known Weaknesses
- Copy tasks: “Repeat the 5th word in the input” — Mamba struggles because it can’t directly point to token 5
- Sorting: Requires comparing elements; attention is more direct
- In-context learning: Models like GPT learn from demonstrations in the prompt. Mamba’s limited state may cap this ability
5. Limited Adoption & Ecosystem
As of late 2024, Mamba is still not widely adopted in production.
Why?
- Transformers are the standard: Years of tooling, library support, best practices
- Custom kernels barrier: Mamba requires CUDA expertise to implement efficiently; most practitioners use Transformers
- Unclear advantage for most tasks: Mamba wins on very long sequences (>10K tokens). For typical tasks (texts under 4K tokens), Transformers are fine
- Hybrid models gaining traction: Jamba (AI21) and others use Mamba + Attention blocks, not pure Mamba
Practical Impact
If you want to build an AI product today:
- Transformers: Dozens of libraries (HuggingFace, LLama.cpp, vLLM, etc.)
- Mamba: Few libraries, less community support
6. Stability and Tuning
Mamba has more hyperparameters and stability concerns than Transformers.
What Can Go Wrong?
- Exploding/vanishing state: If eigenvalues of A are chosen poorly, x_t can explode or vanish
- Δ distribution: The softplus output for Δ must be in a reasonable range. Too small → no learning. Too large → numerical issues
- State dimension: Trade-off between memory (larger state) and performance (smaller state has less capacity)
Transformer Stability
Transformers have proven stable across many settings. Mamba requires more careful tuning.
7. Not Better on Many Benchmarks
Despite the hype, Mamba doesn’t universally beat Transformers.
Benchmark Results
Task: Language Modeling (Chinchilla 7B model size)
Mamba: Better perplexity
Transformer: Comparable
Task: Code (HumanEval)
Mamba: 57%
Transformer: 55%
(Mamba slightly better, but both are trained from scratch with same compute)
Task: Math (GSM8K)
Mamba: ~56%
Transformer: ~58%
(Transformer slightly better!)
Task: Multi-choice knowledge (MMLU)
Mamba: 62.5% (Mamba 370M)
Transformer: 70.5% (Llama 370M)
(Transformer wins significantly!)
Task: Instruct following (on long documents)
Mamba: Better (can handle 1M tokens without slowdown)
Transformer: Worse (OOM or very slow at 1M tokens)
Mamba is not better at everything, just at specific use cases (long context).
8. Recency Bias & Unproven at Scale
Mamba was published December 2023. As of early 2025:
- Largest Mamba models: ~7B parameters (Mamba 7B)
- Largest Transformers: ~700B+ parameters (GPT-4, PaLM, Grok)
We don’t yet know if Mamba scales as well as Transformers to frontier scales. The scaling laws may be different.
Open Questions
- Does Mamba match Transformer quality at 100B+ parameters?
- Can Mamba be fine-tuned as easily as Transformers?
- Do Mamba models generalize as well to new domains?
These remain unanswered.
9. Difficult to Extend
Mamba’s architecture is less “hackable” than Transformers.
Examples
Adding features to Transformers:
- Cross-attention for retrieval-augmented generation (RAG): Easy (just modify attention)
- Expert layers (mixture of experts): Easy (mix logits from different heads)
- Multi-modal (images + text): Easy (concatenate embeddings, apply same attention)
Adding features to Mamba:
- Cross-attention: Hard (not naturally expressed in SSM framework)
- Multi-modal: Hard (unclear how to integrate different input types into the selective mechanism)
- Mixture of experts: Hard (SSM routing is opaque)
The linear-time property is elegant, but it makes Mamba less flexible for extensions.
When Mamba Wins
Despite these limitations, Mamba is valuable for:
- Very long sequences (10K+ tokens) where Transformer memory explodes
- Real-time generation where latency per token matters (Mamba: O(1), Transformer: O(n) with KV cache)
- On-device models where memory is constrained
- Theoretical understanding of alternatives to attention
The Verdict
Mamba is not a Transformer killer. It’s a complementary approach:
- Use Transformers: Most NLP tasks, vision-language models, knowledge-intensive tasks
- Use Mamba (or hybrid): Long documents, real-time systems, memory-constrained devices
The future likely involves hybrid architectures (Mamba + Attention blocks), not a complete replacement.