Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look

A developer traced a hybrid Mamba-Transformer MoE inference run and found that MoE all-to-all collective stalls dominate the tail latency, with a 69x tail ratio, despite dashboards showing 96% GPU utilization. The per-layer decomposition reveals that these stalls are invisible to standard metrics like nvidia-smi and vLLM TTFT, which aggregate over the run. The developer suggests engine-side fixes such as expert-aware batching, communication-computation overlap, or dynamic expert placement to mitigate the issue.

A trace of a hybrid Mamba-Transformer MoE inference run, broken down by layer type. The MoE all-to-all collective stalls dominate the tail. The dashboards saw 96% GPU utilization the entire window. Hybrid Mamba-Transformer architectures Nemotron 3 Nano Omni, Jamba and friends shipped at speed in late April. These models break the assumptions vLLM and SGLang dashboards make about prefill/decode shape: Mamba state-space layers have one runtime profile, Transformer attention has another, MoE router blocks have a third with all-to-all collective comm . The aggregate looks fine on a duty-cycle counter; the per-layer tail is full of hybrid MoE stalls nobody is decomposing. We trace one and decompose it. NVIDIA Nemotron 3 Nano Omni https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/ Apr 28, open multimodal MoE is the most prominent recent shipment, but it is one of several. The shape is consistent: a hybrid Mamba-Transformer backbone with mixture-of-experts routing, tuned to claim higher throughput than pure-Transformer baselines at comparable parameter counts. On the inference engine side, vLLM and SGLang already track per-request metrics: TTFT, ITL, throughput. They do not yet decompose those metrics by layer type. For pure-Transformer models, the decomposition is mostly uninteresting every layer has roughly the same runtime profile . For hybrid MoE, the decomposition is the entire story. We captured a 60-second inference trace on a TensorDock H100 running a hybrid Mamba-Transformer MoE checkpoint and broke the kernel-launch events down by layer type: layer type n calls p50 us p99 us tail ratio ---------------------------------------------------------- Mamba SSM 3,840 42 95 2.3x Transformer attn 1,920 88 320 3.6x MoE all-to-all 640 180 12,400 69x The aggregate runtime distribution looks moderate: median 50us, p99 300us. The decomposition shows that the MoE all-to-all calls are 69x tail-heavy, dominating wall time despite being 1/9th the call count. The Mamba layers are tight and predictable. The Transformer attention is bursty because of variable-length prefill. The MoE all-to-all is where the model spends its tail. Throughout the same 60-second window, nvidia-smi reported 95-97% GPU utilization. DCGM SM ACTIVE was at 92% mean. The vLLM-style metrics showed median TTFT 220ms – within target. None of those signals captured the per-layer-type variance, because they are all duty-cycle or end-to-end measurements that aggregate over the run. The MoE all-to-all stall pattern is a classic case of throughput bottlenecked by the slowest variant: when one expert routing pattern produces an unbalanced communication step, the entire batch waits. The eBPF trace catches it because every cudaLaunchKernel and cudaStreamSync is recorded with timestamp + caller stack, so the per-layer decomposition is just a SQL query over the captured events. Once the decomposition is in front of you, the engine choices change: All three are reasonable engine-side fixes. None of them is reachable without per-layer-type runtime data. Capture a trace under load: sudo ingero check sudo ingero trace --duration 60s --db /tmp/hybrid.db Aggregate per-kernel-name runtime distribution ingero query --db /tmp/hybrid.db \ "SELECT name, count , percentile cont 0.5 WITHIN GROUP ORDER BY duration us AS p50, percentile cont 0.99 WITHIN GROUP ORDER BY duration us AS p99 FROM events WHERE source='cuda' GROUP BY name ORDER BY p99 DESC LIMIT 20" The kernel names will give away which layer type each row belongs to Mamba layers reach conv1d and selective scan fwd , Transformer attn reach fused attention , MoE all-to-all reach nccl all to all or framework-specific dispatch wrappers . Three public references for the hybrid-architecture regime: NVIDIA Nemotron 3 Nano Omni https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/ April 28, 2026 is the most prominent recent open hybrid Mamba-Transformer MoE checkpoint and the source of the kernel-name patterns shown above; the Mamba paper arXiv 2312.00752 https://arxiv.org/abs/2312.00752 describes the state-space layer’s structural difference from Transformer attention; and the vLLM documentation https://docs.vllm.ai/en/latest/ explains the prefill/decode batching model the per-layer decomposition above breaks against. When the architecture stops being uniform, the metrics that aggregate across the architecture stop being useful. Hybrid MoEs need per-layer-type decomposition to surface the regimes where one layer type dominates tail latency. eBPF gives the decomposition for free; the only thing missing is the SQL query that asks the right question. As more hybrid architectures ship, the dashboard layer will need to catch up – or the engineers running them will keep going under the dashboard with kernel-level traces instead. Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. GitHub ⭐ https://github.com/ingero-io/ingero · Open an issue if you are running hybrid Mamba-Transformer MoE inference and seeing tail latency the dashboards do not explain.