{"slug": "hybrid-mamba-transformer-moes-hide-their-stalls-in-places-dashboards-do-not-look", "title": "Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look", "summary": "A developer traced a hybrid Mamba-Transformer MoE inference run and found that MoE all-to-all collective stalls dominate the tail latency, with a 69x tail ratio, despite dashboards showing 96% GPU utilization. The per-layer decomposition reveals that these stalls are invisible to standard metrics like nvidia-smi and vLLM TTFT, which aggregate over the run. The developer suggests engine-side fixes such as expert-aware batching, communication-computation overlap, or dynamic expert placement to mitigate the issue.", "body_md": "*A trace of a hybrid Mamba-Transformer MoE inference run, broken down by layer type. The MoE all-to-all collective stalls dominate the tail. The dashboards saw 96% GPU utilization the entire window.*\n\nHybrid Mamba-Transformer architectures (Nemotron 3 Nano Omni, Jamba and friends) shipped at speed in late April. These models break the assumptions vLLM and SGLang dashboards make about prefill/decode shape: Mamba state-space layers have one runtime profile, Transformer attention has another, MoE router blocks have a third (with all-to-all collective comm). The aggregate looks fine on a duty-cycle counter; the per-layer tail is full of hybrid MoE stalls nobody is decomposing. We trace one and decompose it.\n\n[NVIDIA Nemotron 3 Nano Omni](https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/) (Apr 28, open multimodal MoE) is the most prominent recent shipment, but it is one of several. The shape is consistent: a hybrid Mamba-Transformer backbone with mixture-of-experts routing, tuned to claim higher throughput than pure-Transformer baselines at comparable parameter counts.\n\nOn the inference engine side, vLLM and SGLang already track per-request metrics: TTFT, ITL, throughput. They do not yet decompose those metrics by layer type. For pure-Transformer models, the decomposition is mostly uninteresting (every layer has roughly the same runtime profile). For hybrid MoE, the decomposition is the entire story.\n\nWe captured a 60-second inference trace on a TensorDock H100 running a hybrid Mamba-Transformer MoE checkpoint and broke the kernel-launch events down by layer type:\n\n```\nlayer type      n calls   p50 (us)   p99 (us)   tail ratio\n----------------------------------------------------------\nMamba SSM         3,840         42         95         2.3x\nTransformer attn  1,920         88        320         3.6x\nMoE all-to-all      640        180     12,400        69x\n```\n\nThe aggregate runtime distribution looks moderate: median 50us, p99 300us. The decomposition shows that the MoE all-to-all calls are 69x tail-heavy, dominating wall time despite being 1/9th the call count. The Mamba layers are tight and predictable. The Transformer attention is bursty because of variable-length prefill. The MoE all-to-all is where the model spends its tail.\n\nThroughout the same 60-second window, nvidia-smi reported 95-97% GPU utilization. DCGM SM_ACTIVE was at 92% mean. The vLLM-style metrics showed median TTFT 220ms – within target. None of those signals captured the per-layer-type variance, because they are all duty-cycle or end-to-end measurements that aggregate over the run.\n\nThe MoE all-to-all stall pattern is a classic case of throughput bottlenecked by the slowest variant: when one expert routing pattern produces an unbalanced communication step, the entire batch waits. The eBPF trace catches it because every `cudaLaunchKernel`\n\nand `cudaStreamSync`\n\nis recorded with timestamp + caller stack, so the per-layer decomposition is just a SQL query over the captured events.\n\nOnce the decomposition is in front of you, the engine choices change:\n\nAll three are reasonable engine-side fixes. None of them is reachable without per-layer-type runtime data.\n\nCapture a trace under load:\n\n```\nsudo ingero check\nsudo ingero trace --duration 60s --db /tmp/hybrid.db\n\n# Aggregate per-kernel-name runtime distribution\ningero query --db /tmp/hybrid.db \\\n  \"SELECT name, count(*), percentile_cont(0.5) WITHIN GROUP (ORDER BY duration_us) AS p50, percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_us) AS p99 FROM events WHERE source='cuda' GROUP BY name ORDER BY p99 DESC LIMIT 20\"\n```\n\nThe kernel names will give away which layer type each row belongs to (Mamba layers reach `conv1d`\n\nand `selective_scan_fwd`\n\n, Transformer attn reach `fused_attention`\n\n, MoE all-to-all reach `nccl_all_to_all`\n\nor framework-specific dispatch wrappers).\n\nThree public references for the hybrid-architecture regime: [NVIDIA Nemotron 3 Nano Omni](https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/) (April 28, 2026) is the most prominent recent open hybrid Mamba-Transformer MoE checkpoint and the source of the kernel-name patterns shown above; [the Mamba paper (arXiv 2312.00752)](https://arxiv.org/abs/2312.00752) describes the state-space layer’s structural difference from Transformer attention; and the [vLLM documentation](https://docs.vllm.ai/en/latest/) explains the prefill/decode batching model the per-layer decomposition above breaks against.\n\nWhen the architecture stops being uniform, the metrics that aggregate across the architecture stop being useful. Hybrid MoEs need per-layer-type decomposition to surface the regimes where one layer type dominates tail latency. eBPF gives the decomposition for free; the only thing missing is the SQL query that asks the right question. As more hybrid architectures ship, the dashboard layer will need to catch up – or the engineers running them will keep going under the dashboard with kernel-level traces instead.\n\n*Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. **[GitHub ⭐](https://github.com/ingero-io/ingero)** · ** Open an issue** if you are running hybrid Mamba-Transformer MoE inference and seeing tail latency the dashboards do not explain.*", "url": "https://wpnews.pro/news/hybrid-mamba-transformer-moes-hide-their-stalls-in-places-dashboards-do-not-look", "canonical_source": "https://dev.to/ingero/hybrid-mamba-transformer-moes-hide-their-stalls-in-places-dashboards-do-not-look-56dg", "published_at": "2026-06-15 13:00:00+00:00", "updated_at": "2026-06-15 13:06:36.065510+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "ai-infrastructure", "developer-tools"], "entities": ["NVIDIA", "Nemotron 3 Nano Omni", "vLLM", "SGLang", "TensorDock", "H100", "Mamba", "MoE"], "alternates": {"html": "https://wpnews.pro/news/hybrid-mamba-transformer-moes-hide-their-stalls-in-places-dashboards-do-not-look", "markdown": "https://wpnews.pro/news/hybrid-mamba-transformer-moes-hide-their-stalls-in-places-dashboards-do-not-look.md", "text": "https://wpnews.pro/news/hybrid-mamba-transformer-moes-hide-their-stalls-in-places-dashboards-do-not-look.txt", "jsonld": "https://wpnews.pro/news/hybrid-mamba-transformer-moes-hide-their-stalls-in-places-dashboards-do-not-look.jsonld"}}