# Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look

> Source: <https://dev.to/ingero/hybrid-mamba-transformer-moes-hide-their-stalls-in-places-dashboards-do-not-look-56dg>
> Published: 2026-06-15 13:00:00+00:00

*A trace of a hybrid Mamba-Transformer MoE inference run, broken down by layer type. The MoE all-to-all collective stalls dominate the tail. The dashboards saw 96% GPU utilization the entire window.*

Hybrid Mamba-Transformer architectures (Nemotron 3 Nano Omni, Jamba and friends) shipped at speed in late April. These models break the assumptions vLLM and SGLang dashboards make about prefill/decode shape: Mamba state-space layers have one runtime profile, Transformer attention has another, MoE router blocks have a third (with all-to-all collective comm). The aggregate looks fine on a duty-cycle counter; the per-layer tail is full of hybrid MoE stalls nobody is decomposing. We trace one and decompose it.

[NVIDIA Nemotron 3 Nano Omni](https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/) (Apr 28, open multimodal MoE) is the most prominent recent shipment, but it is one of several. The shape is consistent: a hybrid Mamba-Transformer backbone with mixture-of-experts routing, tuned to claim higher throughput than pure-Transformer baselines at comparable parameter counts.

On the inference engine side, vLLM and SGLang already track per-request metrics: TTFT, ITL, throughput. They do not yet decompose those metrics by layer type. For pure-Transformer models, the decomposition is mostly uninteresting (every layer has roughly the same runtime profile). For hybrid MoE, the decomposition is the entire story.

We captured a 60-second inference trace on a TensorDock H100 running a hybrid Mamba-Transformer MoE checkpoint and broke the kernel-launch events down by layer type:

```
layer type      n calls   p50 (us)   p99 (us)   tail ratio
----------------------------------------------------------
Mamba SSM         3,840         42         95         2.3x
Transformer attn  1,920         88        320         3.6x
MoE all-to-all      640        180     12,400        69x
```

The aggregate runtime distribution looks moderate: median 50us, p99 300us. The decomposition shows that the MoE all-to-all calls are 69x tail-heavy, dominating wall time despite being 1/9th the call count. The Mamba layers are tight and predictable. The Transformer attention is bursty because of variable-length prefill. The MoE all-to-all is where the model spends its tail.

Throughout the same 60-second window, nvidia-smi reported 95-97% GPU utilization. DCGM SM_ACTIVE was at 92% mean. The vLLM-style metrics showed median TTFT 220ms – within target. None of those signals captured the per-layer-type variance, because they are all duty-cycle or end-to-end measurements that aggregate over the run.

The MoE all-to-all stall pattern is a classic case of throughput bottlenecked by the slowest variant: when one expert routing pattern produces an unbalanced communication step, the entire batch waits. The eBPF trace catches it because every `cudaLaunchKernel`

and `cudaStreamSync`

is recorded with timestamp + caller stack, so the per-layer decomposition is just a SQL query over the captured events.

Once the decomposition is in front of you, the engine choices change:

All three are reasonable engine-side fixes. None of them is reachable without per-layer-type runtime data.

Capture a trace under load:

```
sudo ingero check
sudo ingero trace --duration 60s --db /tmp/hybrid.db

# Aggregate per-kernel-name runtime distribution
ingero query --db /tmp/hybrid.db \
  "SELECT name, count(*), percentile_cont(0.5) WITHIN GROUP (ORDER BY duration_us) AS p50, percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_us) AS p99 FROM events WHERE source='cuda' GROUP BY name ORDER BY p99 DESC LIMIT 20"
```

The kernel names will give away which layer type each row belongs to (Mamba layers reach `conv1d`

and `selective_scan_fwd`

, Transformer attn reach `fused_attention`

, MoE all-to-all reach `nccl_all_to_all`

or framework-specific dispatch wrappers).

Three public references for the hybrid-architecture regime: [NVIDIA Nemotron 3 Nano Omni](https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/) (April 28, 2026) is the most prominent recent open hybrid Mamba-Transformer MoE checkpoint and the source of the kernel-name patterns shown above; [the Mamba paper (arXiv 2312.00752)](https://arxiv.org/abs/2312.00752) describes the state-space layer’s structural difference from Transformer attention; and the [vLLM documentation](https://docs.vllm.ai/en/latest/) explains the prefill/decode batching model the per-layer decomposition above breaks against.

When the architecture stops being uniform, the metrics that aggregate across the architecture stop being useful. Hybrid MoEs need per-layer-type decomposition to surface the regimes where one layer type dominates tail latency. eBPF gives the decomposition for free; the only thing missing is the SQL query that asks the right question. As more hybrid architectures ship, the dashboard layer will need to catch up – or the engineers running them will keep going under the dashboard with kernel-level traces instead.

*Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. **[GitHub ⭐](https://github.com/ingero-io/ingero)** · ** Open an issue** if you are running hybrid Mamba-Transformer MoE inference and seeing tail latency the dashboards do not explain.*
