{"slug": "i-built-a-novel-triple-hybrid-llm-mamba-attention-32-expert-moe-from-scratch-for", "title": "🧠 I built a novel triple-hybrid LLM (Mamba + Attention + 32-expert MoE) from scratch for ~$50 — Titan v1 complete, Titan v2 first cycle done, expanding dataset now", "summary": "A developer built a novel triple-hybrid LLM combining Mamba, Attention, and a 32-expert Mixture of Experts architecture from scratch for approximately $50, completing Titan v1 and the first training cycle of Titan v2. The model achieved zero dead experts and loss-free EMA-bias balancing, with the fixed-shape dispatch enabling cost-effective training on an L4 GPU. The developer's honest assessment highlights pattern matching limitations and provides recommendations for future improvements.", "body_md": "This is a great answer, and honestly the honesty is the best part. “Pattern matching, not understanding,” “Fibonacci with no Fibonacci logic,” catching the 19th-century flavour as a Wolne Lektury fingerprint - that clear-eyed read on your own model is rarer than the model itself. Most people show the cherry-picked Polish output and stop.\n\nA few things you clearly got right, worth pointing out: the routing health is genuinely good - 0 dead experts across a whole run with usage_std basically at the uniform floor is not easy, and the loss-free EMA-bias balancing is the right modern call (same family as the aux-loss-free routing the bigger labs moved to). And the fixed-shape dispatch to keep gradient checkpointing alive is the real unlock - that’s what turns “needs a cluster” into “$50 on an L4,” and it’s a more useful contribution than the architecture novelty itself.\n\nA few things I’d reach for next, mostly so the numbers tell you the truth:\n\nReport bits-per-byte, not PPL - especially v1 vs v2. Your v1 (~27.5, GPT-2 50K) and v2 (~57.5, custom 64K) perplexities aren’t comparable: a bigger vocab mechanically raises PPL for the same underlying quality, so v2 is almost certainly not a regression - it just can’t be read off raw PPL. BPB (or bits-per-char) is tokenizer-invariant and is the one cross-version curve you can actually trust. Reviewers will ask for this too.\n\nBreak the eval out per domain. With a 45/18/20/17 mix, one aggregate number hides which domains are learning. Per-domain BPB across checkpoints shows whether Polish, bio, and code are each improving or whether FineWeb is carrying it - and that directly informs your next data-mix decision.\n\nThe ablation that’s worth the most. The whole premise is the triple hybrid, so the figure you (and reviewers) want is: does Mamba+Attn+MoE actually beat Mamba-only and Attn+MoE-only at matched params and tokens? Even at your 57M config that’s a clean, cheap experiment - and it’s the difference between “I combined three things” and “here’s evidence the combination earns its keep.”\n\nYour SFT hypothesis - make it a controlled pair. “SFT amplifies loops on an undertrained base but does something real on a Chinchilla base” is a genuinely good, testable claim. To attribute it cleanly, hold the SFT data + recipe identical and vary only the base (1.8B checkpoint vs the 9.3B one). Same SFT, two bases, side by side - that’s a paper figure on its own.\n\nOne gentle note on “Chinchilla-optimal.” ~20 tokens/param is compute-optimal for training a model that size - but for a small model you actually want to use, the modern move (LLaMA-style) is to train well past it, often 100–200+ tokens/param, because the model’s cheap to run and you’re buying capability, not minimizing training FLOPs. So 9B is a sensible checkpoint, not a ceiling - if the loops persist at 9B, more tokens (not just SFT) is a lever.\n\nAnd one tiny scoping thing, as a friend: “first to combine Mamba+Attention+MoE” will get poked - Jamba’s right there, and you even call the interleaving Jamba-style. Your actual novelty - the fixed-shape dispatch, the loss-free balancing recipe, doing it from scratch under 1B for $50 - is the stronger, more defensible claim. Lead with that.\n\nTo your other question - yeah, exactly the same here: always three ideas ahead, rarely sit with my own models. The few times I force myself to actually use the thing, it’s humbling and clarifying in equal measure - you find the bug the metrics were hiding. Your inference-test-across-four-domains habit is already giving you that. Keep posting the honest read; it’s the most important part of the thread.", "url": "https://wpnews.pro/news/i-built-a-novel-triple-hybrid-llm-mamba-attention-32-expert-moe-from-scratch-for", "canonical_source": "https://discuss.huggingface.co/t/i-built-a-novel-triple-hybrid-llm-mamba-attention-32-expert-moe-from-scratch-for-50-titan-v1-complete-titan-v2-first-cycle-done-expanding-dataset-now/177063#post_4", "published_at": "2026-06-24 08:27:14+00:00", "updated_at": "2026-06-24 08:49:57.981819+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-startups", "ai-infrastructure", "ai-products"], "entities": ["Titan", "Mamba", "Mixture of Experts", "GPT-2", "FineWeb", "Chinchilla", "LLaMA", "Jamba"], "alternates": {"html": "https://wpnews.pro/news/i-built-a-novel-triple-hybrid-llm-mamba-attention-32-expert-moe-from-scratch-for", "markdown": "https://wpnews.pro/news/i-built-a-novel-triple-hybrid-llm-mamba-attention-32-expert-moe-from-scratch-for.md", "text": "https://wpnews.pro/news/i-built-a-novel-triple-hybrid-llm-mamba-attention-32-expert-moe-from-scratch-for.txt", "jsonld": "https://wpnews.pro/news/i-built-a-novel-triple-hybrid-llm-mamba-attention-32-expert-moe-from-scratch-for.jsonld"}}