# 🧠 I built a novel triple-hybrid LLM (Mamba + Attention + 32-expert MoE) from scratch for ~$50 — Titan v1 complete, Titan v2 first cycle done, expanding dataset now

> Source: <https://discuss.huggingface.co/t/i-built-a-novel-triple-hybrid-llm-mamba-attention-32-expert-moe-from-scratch-for-50-titan-v1-complete-titan-v2-first-cycle-done-expanding-dataset-now/177063#post_4>
> Published: 2026-06-24 08:27:14+00:00

This is a great answer, and honestly the honesty is the best part. “Pattern matching, not understanding,” “Fibonacci with no Fibonacci logic,” catching the 19th-century flavour as a Wolne Lektury fingerprint - that clear-eyed read on your own model is rarer than the model itself. Most people show the cherry-picked Polish output and stop.

A few things you clearly got right, worth pointing out: the routing health is genuinely good - 0 dead experts across a whole run with usage_std basically at the uniform floor is not easy, and the loss-free EMA-bias balancing is the right modern call (same family as the aux-loss-free routing the bigger labs moved to). And the fixed-shape dispatch to keep gradient checkpointing alive is the real unlock - that’s what turns “needs a cluster” into “$50 on an L4,” and it’s a more useful contribution than the architecture novelty itself.

A few things I’d reach for next, mostly so the numbers tell you the truth:

Report bits-per-byte, not PPL - especially v1 vs v2. Your v1 (~27.5, GPT-2 50K) and v2 (~57.5, custom 64K) perplexities aren’t comparable: a bigger vocab mechanically raises PPL for the same underlying quality, so v2 is almost certainly not a regression - it just can’t be read off raw PPL. BPB (or bits-per-char) is tokenizer-invariant and is the one cross-version curve you can actually trust. Reviewers will ask for this too.

Break the eval out per domain. With a 45/18/20/17 mix, one aggregate number hides which domains are learning. Per-domain BPB across checkpoints shows whether Polish, bio, and code are each improving or whether FineWeb is carrying it - and that directly informs your next data-mix decision.

The ablation that’s worth the most. The whole premise is the triple hybrid, so the figure you (and reviewers) want is: does Mamba+Attn+MoE actually beat Mamba-only and Attn+MoE-only at matched params and tokens? Even at your 57M config that’s a clean, cheap experiment - and it’s the difference between “I combined three things” and “here’s evidence the combination earns its keep.”

Your SFT hypothesis - make it a controlled pair. “SFT amplifies loops on an undertrained base but does something real on a Chinchilla base” is a genuinely good, testable claim. To attribute it cleanly, hold the SFT data + recipe identical and vary only the base (1.8B checkpoint vs the 9.3B one). Same SFT, two bases, side by side - that’s a paper figure on its own.

One gentle note on “Chinchilla-optimal.” ~20 tokens/param is compute-optimal for training a model that size - but for a small model you actually want to use, the modern move (LLaMA-style) is to train well past it, often 100–200+ tokens/param, because the model’s cheap to run and you’re buying capability, not minimizing training FLOPs. So 9B is a sensible checkpoint, not a ceiling - if the loops persist at 9B, more tokens (not just SFT) is a lever.

And one tiny scoping thing, as a friend: “first to combine Mamba+Attention+MoE” will get poked - Jamba’s right there, and you even call the interleaving Jamba-style. Your actual novelty - the fixed-shape dispatch, the loss-free balancing recipe, doing it from scratch under 1B for $50 - is the stronger, more defensible claim. Lead with that.

To your other question - yeah, exactly the same here: always three ideas ahead, rarely sit with my own models. The few times I force myself to actually use the thing, it’s humbling and clarifying in equal measure - you find the bug the metrics were hiding. Your inference-test-across-four-domains habit is already giving you that. Keep posting the honest read; it’s the most important part of the thread.