cd /news/large-language-models/i-built-a-novel-triple-hybrid-llm-ma… · home topics large-language-models article
[ARTICLE · art-37476] src=discuss.huggingface.co ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

🧠 I built a novel triple-hybrid LLM (Mamba + Attention + 32-expert MoE) from scratch for ~$50 — Titan v1 complete, Titan v2 first cycle done, expanding dataset now

A developer built a novel triple-hybrid LLM combining Mamba, Attention, and a 32-expert Mixture of Experts architecture from scratch for approximately $50, completing Titan v1 and the first training cycle of Titan v2. The model achieved zero dead experts and loss-free EMA-bias balancing, with the fixed-shape dispatch enabling cost-effective training on an L4 GPU. The developer's honest assessment highlights pattern matching limitations and provides recommendations for future improvements.

read3 min views1 publishedJun 24, 2026

This is a great answer, and honestly the honesty is the best part. “Pattern matching, not understanding,” “Fibonacci with no Fibonacci logic,” catching the 19th-century flavour as a Wolne Lektury fingerprint - that clear-eyed read on your own model is rarer than the model itself. Most people show the cherry-picked Polish output and stop.

A few things you clearly got right, worth pointing out: the routing health is genuinely good - 0 dead experts across a whole run with usage_std basically at the uniform floor is not easy, and the loss-free EMA-bias balancing is the right modern call (same family as the aux-loss-free routing the bigger labs moved to). And the fixed-shape dispatch to keep gradient checkpointing alive is the real unlock - that’s what turns “needs a cluster” into “$50 on an L4,” and it’s a more useful contribution than the architecture novelty itself.

A few things I’d reach for next, mostly so the numbers tell you the truth:

Report bits-per-byte, not PPL - especially v1 vs v2. Your v1 (~27.5, GPT-2 50K) and v2 (~57.5, custom 64K) perplexities aren’t comparable: a bigger vocab mechanically raises PPL for the same underlying quality, so v2 is almost certainly not a regression - it just can’t be read off raw PPL. BPB (or bits-per-char) is tokenizer-invariant and is the one cross-version curve you can actually trust. Reviewers will ask for this too.

Break the eval out per domain. With a 45/18/20/17 mix, one aggregate number hides which domains are learning. Per-domain BPB across checkpoints shows whether Polish, bio, and code are each improving or whether FineWeb is carrying it - and that directly informs your next data-mix decision.

The ablation that’s worth the most. The whole premise is the triple hybrid, so the figure you (and reviewers) want is: does Mamba+Attn+MoE actually beat Mamba-only and Attn+MoE-only at matched params and tokens? Even at your 57M config that’s a clean, cheap experiment - and it’s the difference between “I combined three things” and “here’s evidence the combination earns its keep.”

Your SFT hypothesis - make it a controlled pair. “SFT amplifies loops on an undertrained base but does something real on a Chinchilla base” is a genuinely good, testable claim. To attribute it cleanly, hold the SFT data + recipe identical and vary only the base (1.8B checkpoint vs the 9.3B one). Same SFT, two bases, side by side - that’s a paper figure on its own.

One gentle note on “Chinchilla-optimal.” ~20 tokens/param is compute-optimal for training a model that size - but for a small model you actually want to use, the modern move (LLaMA-style) is to train well past it, often 100–200+ tokens/param, because the model’s cheap to run and you’re buying capability, not minimizing training FLOPs. So 9B is a sensible checkpoint, not a ceiling - if the loops persist at 9B, more tokens (not just SFT) is a lever.

And one tiny scoping thing, as a friend: “first to combine Mamba+Attention+MoE” will get poked - Jamba’s right there, and you even call the interleaving Jamba-style. Your actual novelty - the fixed-shape dispatch, the loss-free balancing recipe, doing it from scratch under 1B for $50 - is the stronger, more defensible claim. Lead with that. To your other question - yeah, exactly the same here: always three ideas ahead, rarely sit with my own models. The few times I force myself to actually use the thing, it’s humbling and clarifying in equal measure - you find the bug the metrics were hiding. Your inference-test-across-four-domains habit is already giving you that. Keep posting the honest read; it’s the most important part of the thread.

── more in #large-language-models 4 stories · sorted by recency
── more on @titan 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-built-a-novel-trip…] indexed:0 read:3min 2026-06-24 ·