Mamba2 Architecture — How it works and where can we improve it? #193184

Kahabk · 2026-04-20T13:08:13Z

Kahabk
Apr 20, 2026

Discussion Type

Question

Discussion Content

Hey everyone 👋

I've been going through the Mamba2 paper and playing around with it
for a while now, and I genuinely find the architecture really
interesting. Wanted to start an open conversation about it because
I feel like there's a lot to unpack here.

So Mamba2 brought in this State Space Duality (SSD) idea which kind
of bridges the gap between SSMs and attention — which is honestly
a clever move. But I've been sitting with a few questions that I
can't fully answer on my own.

Like — how far can this actually scale? Transformers have been
pushed to hundreds of billions of parameters and we know their
breaking points. But with Mamba2, are we confident the SSD layer
holds up at that scale too?

Also the hybrid approach (mixing Mamba2 with attention layers) seems
to be gaining traction. Jamba does this. But does mixing defeat the
whole point of moving away from attention in the first place?
Curious what people think.

And long context — this is where Mamba2 should theoretically shine
over transformers. Has anyone actually stress tested it at 100k+
tokens in a real task? Would love to see real numbers not just
theoretical complexity arguments.

A few things I personally think could be improved:

Fine tuning story is still weak. LoRA works great for transformers
but nobody really has a solid recipe for Mamba2 yet.
State initialization feels underexplored. The defaults work but
I doubt they're optimal for every task.
Would love to see someone seriously try MoE routing inside Mamba2
layers. Feels like a natural next step.

Anyway I'm not an expert here, just someone genuinely curious and
trying to learn. If you've done any work on this or have strong
opinions — please jump in. Would love a real conversation about this.

abbosaliboev · 2026-04-20T17:53:08Z

abbosaliboev
Apr 20, 2026

Hi @Kahabk,

This is a great conversation starter! Mamba2 and SSD are definitely the "hot topics" in architecture right now. Here is a simplified take on the points you raised:

Can it scale to 100B+?
The main challenge with scaling Mamba2 isn't the number of parameters, but the "Notebook" (Hidden State) capacity.

Think of it this way: Transformers have "Photographic Memory" (they look at everything), while Mamba2 has a "Smart Notebook" (it summarizes everything).

As we go to 100B+, the summary in that notebook has to be incredibly perfect. If the "notebook" gets too crowded, the model might start losing fine details. We haven't seen a 100B pure Mamba yet because we're still figuring out how to keep that summary from getting "blurry."

Is the Hybrid approach "cheating"?
Actually, it’s just smart engineering—like a Hybrid Car.

Attention is the electric motor (perfect for precision but eats up battery/memory).

Mamba is the gas engine (extremely efficient for long distances).
By mixing them, models like Jamba get the best of both worlds: they can remember specific facts perfectly while running 10x faster and cheaper. It’s not moving away from the point; it’s making it practical for real-world use.

100k+ Token Stress Tests
In real tests, Mamba2 is the "King of Memory." It won't crash your GPU like a Transformer would at 100k tokens. However, it can suffer from the "Lost in the Middle" problem. It’s like reading a massive book in one sitting—you’ll remember the beginning and the end clearly, but the details on page 450 might get a bit fuzzy.

I hope this helps! Good luck with your research and exploring this further ))

0 replies

Gecko51 · 2026-04-20T20:18:56Z

Gecko51
Apr 20, 2026

A few things to add from a more technical angle.

On scaling: the SSD layer is theoretically sound, but the real bottleneck is that d_state needs to scale alongside parameter count, and there's no established recipe for doing that yet. Most experiments have used relatively small state sizes. The 100B question is open mainly because nobody's tried it with the necessary compute budget.

On hybrids: I don't think mixing defeats the point. The original goal wasn't "replace attention entirely," it was "match attention quality at a fraction of the cost." Jamba and Zamba both show you can hit that while keeping attention layers for the cases where precise lookup genuinely matters. The ratio of Mamba to attention layers is still being worked out empirically.

On fine-tuning: there is ongoing work on LoRA for Mamba models. The challenge is that SSM layers don't have weight matrices that map cleanly to what LoRA normally targets in transformers. Some groups have applied LoRA to the projection layers wrapping the SSM block and gotten decent results, but no standardized recipe has emerged yet.

On long context: the 100k+ results I've seen are mostly on synthetic retrieval benchmarks (needle-in-a-haystack style). Real-task performance is more mixed. The bounded state is the culprit. The architectural fix would be dynamically scaling d_state, but that complicates the efficiency argument considerably.

On MoE: MoE-Mamba was actually explored, worth reading if you haven't. The rough finding is that routing quality degrades when the state representation is lossy, which causes weird specialization patterns in the experts. Interesting direction but not cleanly solved yet.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Mamba2 Architecture — How it works and where can we improve it? #193184

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Mamba2 Architecture — How it works and where can we improve it? #193184

Uh oh!

Kahabk Apr 20, 2026

Discussion Type

Discussion Content

Replies: 2 comments

Uh oh!

abbosaliboev Apr 20, 2026

Uh oh!

Gecko51 Apr 20, 2026

Kahabk
Apr 20, 2026

abbosaliboev
Apr 20, 2026

Gecko51
Apr 20, 2026