Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look

wpnews.pro

cd /news/large-language-models/hybrid-mamba-transformer-moes-hide-t… · home › topics › large-language-models › article

[ARTICLE · art-27948] src=dev.to ↗ pub=2026-06-15T13:00Z topic=large-language-models verified=true sentiment=· neutral

Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look

A developer traced a hybrid Mamba-Transformer MoE inference run and found that MoE all-to-all collective stalls dominate the tail latency, with a 69x tail ratio, despite dashboards showing 96% GPU utilization. The per-layer decomposition reveals that these stalls are invisible to standard metrics like nvidia-smi and vLLM TTFT, which aggregate over the run. The developer suggests engine-side fixes such as expert-aware batching, communication-computation overlap, or dynamic expert placement to mitigate the issue.

read4 min views18 publishedJun 15, 2026

A trace of a hybrid Mamba-Transformer MoE inference run, broken down by layer type. The MoE all-to-all collective stalls dominate the tail. The dashboards saw 96% GPU utilization the entire window.

Hybrid Mamba-Transformer architectures (Nemotron 3 Nano Omni, Jamba and friends) shipped at speed in late April. These models break the assumptions vLLM and SGLang dashboards make about prefill/decode shape: Mamba state-space layers have one runtime profile, Transformer attention has another, MoE router blocks have a third (with all-to-all collective comm). The aggregate looks fine on a duty-cycle counter; the per-layer tail is full of hybrid MoE stalls nobody is decomposing. We trace one and decompose it.

NVIDIA Nemotron 3 Nano Omni (Apr 28, open multimodal MoE) is the most prominent recent shipment, but it is one of several. The shape is consistent: a hybrid Mamba-Transformer backbone with mixture-of-experts routing, tuned to claim higher throughput than pure-Transformer baselines at comparable parameter counts.

On the inference engine side, vLLM and SGLang already track per-request metrics: TTFT, ITL, throughput. They do not yet decompose those metrics by layer type. For pure-Transformer models, the decomposition is mostly uninteresting (every layer has roughly the same runtime profile). For hybrid MoE, the decomposition is the entire story.

We captured a 60-second inference trace on a TensorDock H100 running a hybrid Mamba-Transformer MoE checkpoint and broke the kernel-launch events down by layer type:

layer type      n calls   p50 (us)   p99 (us)   tail ratio
----------------------------------------------------------
Mamba SSM         3,840         42         95         2.3x
Transformer attn  1,920         88        320         3.6x
MoE all-to-all      640        180     12,400        69x

The aggregate runtime distribution looks moderate: median 50us, p99 300us. The decomposition shows that the MoE all-to-all calls are 69x tail-heavy, dominating wall time despite being 1/9th the call count. The Mamba layers are tight and predictable. The Transformer attention is bursty because of variable-length prefill. The MoE all-to-all is where the model spends its tail.

Throughout the same 60-second window, nvidia-smi reported 95-97% GPU utilization. DCGM SM_ACTIVE was at 92% mean. The vLLM-style metrics showed median TTFT 220ms – within target. None of those signals captured the per-layer-type variance, because they are all duty-cycle or end-to-end measurements that aggregate over the run.

The MoE all-to-all stall pattern is a classic case of throughput bottlenecked by the slowest variant: when one expert routing pattern produces an unbalanced communication step, the entire batch waits. The eBPF trace catches it because every cudaLaunchKernel

and cudaStreamSync

is recorded with timestamp + caller stack, so the per-layer decomposition is just a SQL query over the captured events.

Once the decomposition is in front of you, the engine choices change:

All three are reasonable engine-side fixes. None of them is reachable without per-layer-type runtime data.

Capture a trace under load:

sudo ingero check
sudo ingero trace --duration 60s --db /tmp/hybrid.db

ingero query --db /tmp/hybrid.db \
  "SELECT name, count(*), percentile_cont(0.5) WITHIN GROUP (ORDER BY duration_us) AS p50, percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_us) AS p99 FROM events WHERE source='cuda' GROUP BY name ORDER BY p99 DESC LIMIT 20"

The kernel names will give away which layer type each row belongs to (Mamba layers reach conv1d

and selective_scan_fwd

, Transformer attn reach fused_attention

, MoE all-to-all reach nccl_all_to_all

or framework-specific dispatch wrappers).

Three public references for the hybrid-architecture regime: NVIDIA Nemotron 3 Nano Omni (April 28, 2026) is the most prominent recent open hybrid Mamba-Transformer MoE checkpoint and the source of the kernel-name patterns shown above; the Mamba paper (arXiv 2312.00752) describes the state-space layer’s structural difference from Transformer attention; and the vLLM documentation explains the prefill/decode batching model the per-layer decomposition above breaks against.

When the architecture stops being uniform, the metrics that aggregate across the architecture stop being useful. Hybrid MoEs need per-layer-type decomposition to surface the regimes where one layer type dominates tail latency. eBPF gives the decomposition for free; the only thing missing is the SQL query that asks the right question. As more hybrid architectures ship, the dashboard layer will need to catch up – or the engineers running them will keep going under the dashboard with kernel-level traces instead.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. GitHub ⭐ · ** Open an issue* if you are running hybrid Mamba-Transformer MoE inference and seeing tail latency the dashboards do not explain.*

source & further reading

dev.to — original article My Auto-Publish Pipeline Shipped a Two-Year-Old News Story. Here's the Fix — All Three Layers of It. Your AI gave that fix 92% confidence. Nothing checked it. Orange Pi 5 Max vs Rock 5B+: The 32GB SBC Battle in 2026

~/api · this article 200

$curl api.wpnews.pro/v1/news/hybrid-mamba-transformer…

Read original on dev.to → dev.to/ingero/hybrid-mamba-transformer-moes-hide…

mentioned entities

NVIDIA

Nemotron 3 Nano Omni

vLLM

SGLang

TensorDock

H100

Mamba

MoE

metadata

slughybrid-mamba-transformer-moes-hide-their-stalls-in-places-dashboards-do-not-look

topic#large-language-models

secondary3 topics

sentimentneutral

canonicaldev.to

navigation

← prevThe Subprime Code Crisis: When F…

next →Is your site visible to AI searc…

── more in #large-language-models 4 stories · sorted by recency

aws.amazon.com · 30 Jul · #large-language-models

Deploying Kimi K3 on AWS

github.com · 30 Jul · #large-language-models

Kimi k3 run on RTX 5090

dev.to · 30 Jul · #large-language-models

How coding agents like Cursor quietly cut input costs by reusing KV states across turns — and what actually breaks the cache

vincentschmalbach.com · 30 Jul · #large-language-models

Google Lighthouse Adds Agentic Browsing Checks and Cloudflare Adds AI Traffic Controls

── more on @nvidia 3 stories trending now

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 29 Jul · #ai-safety

News Summary for July 29, 2026

wpnews · 29 Jul · #ai-safety

Better security starts with better questions

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required