Mamba2 Architecture — How it works and where can we improve it? #193184
Replies: 2 comments
-
|
Hi @Kahabk, This is a great conversation starter! Mamba2 and SSD are definitely the "hot topics" in architecture right now. Here is a simplified take on the points you raised:
Think of it this way: Transformers have "Photographic Memory" (they look at everything), while Mamba2 has a "Smart Notebook" (it summarizes everything). As we go to 100B+, the summary in that notebook has to be incredibly perfect. If the "notebook" gets too crowded, the model might start losing fine details. We haven't seen a 100B pure Mamba yet because we're still figuring out how to keep that summary from getting "blurry."
Attention is the electric motor (perfect for precision but eats up battery/memory). Mamba is the gas engine (extremely efficient for long distances).
I hope this helps! Good luck with your research and exploring this further )) |
Beta Was this translation helpful? Give feedback.
-
|
A few things to add from a more technical angle. On scaling: the SSD layer is theoretically sound, but the real bottleneck is that d_state needs to scale alongside parameter count, and there's no established recipe for doing that yet. Most experiments have used relatively small state sizes. The 100B question is open mainly because nobody's tried it with the necessary compute budget. On hybrids: I don't think mixing defeats the point. The original goal wasn't "replace attention entirely," it was "match attention quality at a fraction of the cost." Jamba and Zamba both show you can hit that while keeping attention layers for the cases where precise lookup genuinely matters. The ratio of Mamba to attention layers is still being worked out empirically. On fine-tuning: there is ongoing work on LoRA for Mamba models. The challenge is that SSM layers don't have weight matrices that map cleanly to what LoRA normally targets in transformers. Some groups have applied LoRA to the projection layers wrapping the SSM block and gotten decent results, but no standardized recipe has emerged yet. On long context: the 100k+ results I've seen are mostly on synthetic retrieval benchmarks (needle-in-a-haystack style). Real-task performance is more mixed. The bounded state is the culprit. The architectural fix would be dynamically scaling d_state, but that complicates the efficiency argument considerably. On MoE: MoE-Mamba was actually explored, worth reading if you haven't. The rough finding is that routing quality degrades when the state representation is lossy, which causes weird specialization patterns in the experts. Interesting direction but not cleanly solved yet. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussion Type
Question
Discussion Content
Hey everyone 👋
I've been going through the Mamba2 paper and playing around with it
for a while now, and I genuinely find the architecture really
interesting. Wanted to start an open conversation about it because
I feel like there's a lot to unpack here.
So Mamba2 brought in this State Space Duality (SSD) idea which kind
of bridges the gap between SSMs and attention — which is honestly
a clever move. But I've been sitting with a few questions that I
can't fully answer on my own.
Like — how far can this actually scale? Transformers have been
pushed to hundreds of billions of parameters and we know their
breaking points. But with Mamba2, are we confident the SSD layer
holds up at that scale too?
Also the hybrid approach (mixing Mamba2 with attention layers) seems
to be gaining traction. Jamba does this. But does mixing defeat the
whole point of moving away from attention in the first place?
Curious what people think.
And long context — this is where Mamba2 should theoretically shine
over transformers. Has anyone actually stress tested it at 100k+
tokens in a real task? Would love to see real numbers not just
theoretical complexity arguments.
A few things I personally think could be improved:
but nobody really has a solid recipe for Mamba2 yet.
I doubt they're optimal for every task.
layers. Feels like a natural next step.
Anyway I'm not an expert here, just someone genuinely curious and
trying to learn. If you've done any work on this or have strong
opinions — please jump in. Would love a real conversation about this.
Beta Was this translation helpful? Give feedback.
All reactions