{"slug": "portable-vllm-model-inference-kernels-in-helion", "title": "Portable vLLM Model Inference Kernels in Helion", "summary": "Helion kernels were integrated into vLLM for FP8 inference using Qwen3 models and evaluated across NVIDIA H100 and B200 GPUs. The experiments demonstrated that Helion provides a productive PyTorch-native workflow for developing fused GPU kernels while delivering performance improvements for quantization, normalization, and fusion-heavy inference kernels. End-to-end benchmarks showed throughput gains across multiple serving scenarios, with additional optimization work underway for GEMM performance on Blackwell GPUs.", "body_md": "### Featured projects\n\n### TL;DR\n\n*Helion kernels were integrated into vLLM for FP8 inference using Qwen3 models and evaluated across NVIDIA H100 and B200 GPUs. The experiments show that Helion provides a productive PyTorch-native workflow for developing fused GPU kernels while delivering performance improvements for many quantization, normalization, and fusion-heavy inference kernels. End-to-end benchmarks demonstrated throughput gains across multiple serving scenarios, with additional optimization work underway for GEMM performance on Blackwell GPUs.*\n\n## Brief Background on vLLM and Helion\n\n[vLLM](https://docs.vllm.ai/en/latest/) is a high-performance inference and serving framework for large language models (LLMs). It is widely used for production LLM serving due to its strong throughput performance, efficient KV-cache management, continuous batching architecture, and support for advanced inference features such as speculative decoding, quantization, and distributed serving. Internally, vLLM relies heavily on custom GPU kernels, TorchInductor fusion, and optimized GEMM backends such as CUTLASS and DeepGEMM to achieve high inference efficiency across different hardware platforms.\n\n[Helion](https://helionlang.com/index.html) is a PyTorch-native hardware agnostic kernel DSL designed for writing high-performance kernels using a tile-programming model. Unlike lower-level CUDA programming, Helion provides a more natural PyTorch-syntax-centric development experience while still exposing low-level control over memory layout, tiling strategy, and kernel scheduling. You can think of it as PyTorch with tiles. If you know PyTorch or Triton, you already know most of Helion. Other than smooth authoring experience, another strength of Helion is its powerful ahead-of-time (AOT) autotuning infrastructure, which can explore a large kernel configuration space and automatically select optimized implementations for specific workloads and hardware targets.\n\n## vLLM Model Inference with Helion Kernels\n\nWe began by focusing on tensor-parallel-free inference using the Qwen3 model family with FP8 activation quantization enabled.\n\nOur goal was to evaluate whether Helion kernels can improve inference performance compared to the existing vLLM implementations.\n\nFor this experiment, we replaced nearly all forward-pass kernels involved in quantized inference with Helion implementations and benchmarked them at both kernel level and end-to-end serving level.\n\n### vLLM Forward Pass Fusion Pattern\n\nFor Qwen3 models, the unfused forward pass in vLLM executes the following sequence of kernels:\n\n- input_norm\n- fp8_quant\n- scaled_mm (qkv_proj)\n- split_qkv\n- q_norm\n- k_norm\n- rope\n- attention\n- fp8_quant\n- scaled_mm (out_proj)\n- post_attention_norm\n- fp8_quant\n- scaled_mm (gate_up)\n- silu_and_mul\n- fp8_quant\n- scaled_mm (down_proj)\n\n**Dynamic Per-Token Activation Quantization**\n\nAfter torch.compile and TorchInductor fusion passes are applied, the execution pattern becomes:\n\n- rms_norm + fp8_quant\n- scaled_mm (qkv_proj)\n- split_qkv + q_norm + v_norm\n- rope\n- attention\n- fp8_quant\n- scaled_mm (out_proj)\n- rms_norm + fp8_quant\n- scaled_mm (gate_up)\n- silu_and_mul + fp8_quant\n- scaled_mm (down_proj)\n\nNote that both `scaled_mm`\n\nand attention are registered as [PyTorch Custom Operators](https://docs.pytorch.org/tutorials/advanced/custom_ops_landing_page.html). Since these operators are opaque to TorchInductor, they form hard boundaries that prevent further compiler-side fusion.\n\n**Dynamic Per-Group Activation Quantization**\n\nWhen dynamic per-group activation quantization is enabled and DeepGEMM is selected for `scaled_mm_blockwise`\n\n, the execution pattern changes to:\n\n- rms_norm\n- fp8_quant (ue8m0)\n- scaled_mm (qkv_proj, DeepGEMM)\n- split_qkv + q_norm + v_norm\n- rope\n- attention\n- fp8_quant (ue8m0)\n- scaled_mm (out_proj, DeepGEMM)\n- rms_norm\n- fp8_quant (ue8m0)\n- scaled_mm (gate_up, DeepGEMM)\n- silu_and_mul\n- fp8_quant (ue8m0)\n- scaled_mm (down_proj, DeepGEMM)\n\nDeepGEMM uses UE8M0 activation quantization internally. In the current vLLM implementation, `fuse_act_quant`\n\nand `fuse_norm_quant`\n\npasses are not supported for UE8M0 quantization, which prevents these additional fusions from occurring.\n\nIf DeepGEMM is unavailable and CUTLASS-based kernels are used instead, the execution pattern becomes similar to the dynamic per-token quantization case.\n\n### Helion Kernels Implementation\n\nFor this work, we implemented the following Helion kernels:\n\n- dynamic_per_token_scaled_fp8_quant\n- rms_norm_dynamic_per_token_quant\n- silu_and_mul_dynamic_per_token_quant\n- fused_qk_norm_rope\n- per_token_group_fp8_quant\n- rms_norm_per_block_quant\n- silu_and_mul_per_block_quant\n- scaled_mm\n- scaled_mm_blockwise\n\nThe `scaled_mm`\n\nand `scaled_mm_blockwise`\n\nkernels follow the existing Triton implementations in vLLM ([triton_scaled_mm](https://github.com/vllm-project/vllm/blob/v0.21.1rc0/vllm/model_executor/layers/quantization/compressed_tensors/triton_scaled_mm.py#L141), [w8a8_triton_block_scaled_mm](https://github.com/vllm-project/vllm/blob/v0.21.1rc0/vllm/model_executor/layers/quantization/utils/fp8_utils.py#L835)). `silu_and_mul_dynamic_per_token_quant`\n\nis a new fused kernel that combines `silu_and_mul`\n\nand `dynamic_per_token_quant`\n\ninto a single kernel launch. The remaining kernels are Helion reimplementations of the existing `torch.ops._C`\n\nCUDA kernels used by vLLM.\n\n### vLLM Helion Kernel Integration\n\nWe integrated these kernels using the [vLLM Helion kernel integration framework](https://github.com/vllm-project/vllm/issues/32219) which provided:\n\n- Autotuning infrastructure\n- Config management\n- Kernel registration\n- Runtime dispatching\n\nTo enable the Helion kernels, we manually updated vLLM fusion passes to replace the corresponding kernels with corresponding Helion fused kernels. After fusion, the forward-pass execution patterns became the following:\n\nFor per-token activation quantization:\n\n- rms_norm_dynamic_per_token_quant (helion)\n- scaled_mm (helion)\n- fused_qk_norm_rope (helion)\n- attention (default)\n- dynamic_per_token_scaled_fp8_quant (helion)\n- scaled_mm (helion)\n- rms_norm_dynamic_per_token_quant (helion)\n- scaled_mm (helion)\n- silu_and_mul_dynamic_per_token_quant (helion)\n- scaled_mm (helion)\n\nFor per-group activation quantization:\n\n- rms_norm_per_block_quant (helion)\n- scaled_mm_blockwise (helion)\n- fused_qk_norm_rope (helion)\n- attention (default)\n- per_token_group_fp8_quant (helion)\n- scaled_mm_blockwise (helion)\n- rms_norm_per_block_quant (helion)\n- scaled_mm_blockwise (helion)\n- silu_and_mul_per_block_quant (helion)\n- scaled_mm_blockwise (helion)\n\n### Autotuning\n\nWe used the Helion’s default [LFBOTreeSearch](https://helionlang.com/api/autotuner.html#helion.autotuner.surrogate_pattern_search.LFBOTreeSearch) algorithm with the following configuration:\n\n```\ninitial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0\n```\n\nTo maximize performance, we autotuned kernels using shapes that exactly match the compile-time static dimensions of each model, such as hidden size and intermediate size. This is the advantage of vLLM-Helion integration – it allows Helion to autotune/store/dispatch configs for many different shapes, the same advantage would apply to real world production use cases too.\n\nFor the dynamic dimension (`num_tokens`\n\n), we autotuned across power-of-two values ranging from 1 to 8192.\n\nFor example, we autotuned `scaled_mm`\n\nkernel for input tensors `[M, K] x [K, N]`\n\n, where\n\n- M ranges from 1 to 8192\n- (K, N) pairs correspond to the projection layers of each Qwen3 model.\n\nModel |\nqkv_proj |\nout_proj |\ngate_up |\ndown_proj |\n| Qwen3-1.7B | [2048, 4096] | [2048, 2048] | [2048, 12288] | [6144, 2048] |\n| Qwen3-8B | [4096, 6144] | [4096, 4096] | [4096, 24576] | [12288, 4096] |\n| Qwen3-32B | [5120, 10240] | [5120, 5120] | [5120, 51200] | [25600, 5120] |\n\n*Tab. 1: Projection layer [K, N] dimensions for each Qwen3 model.*\n\nWe independently autotuned all kernels for each hardware platform under test.\n\n### Runtime Dispatching\n\nAt runtime, the [Helion integration framework](https://github.com/vllm-project/vllm/issues/32219) dispatched requests to the autotuned config most appropriate for the input shape.\n\nFor example, scaled_mm dispatching is performed based on shapes of two input matrices (M, K, N), where M is rounded up to the next power of two according to runtime `num_tokens`\n\nof each batch of requests. Similar strategy is applied to other kernels as well.\n\n## Performance Evaluation – Kernel Level\n\nKernel level benchmarking aims to evaluate the local speedups produced by each individual Helion kernel against their baselines. Specifically, we used CUTLASS as the baseline for `scaled_mm`\n\nand `scaled_mm_blockwise`\n\n. While other ops are compared against torch.compile ‘ed vLLM implementation and existing `torch.ops._C`\n\nkernels. This is because:\n\n- per-token quantization in vLLM uses\n`torch.compile`\n\nby default, - per-group quantization uses\n`torch.ops._C`\n\nCUDA implementations by default due to this[performance issue](https://github.com/vllm-project/vllm/issues/25094).\n\nFor the torch.compile baseline, we matched the vLLM compilation setup:\n\n```\ntorch.compile(\n    native_torch_impl,\n    fullgraph=True,\n    dynamic=False,\n    backend=\"inductor\",\n    options={\n        'enable_auto_functionalized_v2': False,\n        'size_asserts': False,\n        'alignment_asserts': False,\n        'scalar_asserts': False,\n        'combo_kernels': True,\n        'benchmark_combo_kernel': True\n    }\n)\n```\n\nNotably, enabling `'combo_kernels': True`\n\nis important because it allows TorchInductor to fuse multiple independent kernels into a single launch\n\nFor kernel-level benchmarking, we enabled `CudaGraph`\n\nmode via `triton.testing.do_bench_cudagraph`\n\nwith proper warmup and repetitive testing to get rid of noises like dispatch overhead or cold cache and variations in timing.\n\n| Kernel \\ Speedup against baseline (Hardware) | Speedup against torch.compile\n(H100) |\nSpeedup against\ntorch.ops._C (H100) |\nSpeedup against\nCUTLASS (H100) |\nSpeedup against\ntorch.compile (B200) |\nSpeedup against\ntorch.ops._C (B200) |\nSpeedup against CUTLASS\n(B200) |\n| dynamic_per_token_scaled_fp8_quant | 1.237x | 1.405x | N/A | 1.311x | 1.495x | N/A |\n| rms_norm_dynamic_per_token_quant | 1.180x | 1.802x | N/A | 1.240x | 1.969x | N/A |\n| silu_and_mul_dynamic_per_token_quant | 1.256x | N/A | N/A | 1.420x | N/A | N/A |\n| fused_qk_norm_rope | 1.383x | 1.204x | N/A | 1.133x | 1.155x | N/A |\n| per_token_group_fp8_quant | 1.423x | 1.408x | N/A | 1.150x | 1.446x | N/A |\n| rms_norm_per_block_quant | 1.674x | 2.055x | N/A | 1.424x | 2.128x | N/A |\n| silu_and_mul_per_block_quant | 1.731x | 2.269x | N/A | 1.483x | 2.325x | N/A |\n| scaled_mm | N/A | N/A | 1.080x | N/A | N/A | 0.739x |\n| scaled_mm_blockwise | N/A | N/A | 0.957x | N/A | N/A | 0.782x |\n\n*Tab. 2: A summary of the geometric-mean speedups achieved by Helion kernels.*\n\nFor non-GEMM kernels, Helion consistently demonstrates strong performance and outperforms both TorchInductor-generated kernels and the existing vLLM CUDA implementations.\n\nFor GEMM workloads (`scaled_mm`\n\nand `scaled_mm_blockwise`\n\n), results were more mixed:\n\n- On H100, scaled_mm outperformed CUTLASS.\n- On B200, both GEMM kernels currently lagged behind CUTLASS\n\nThe primary limiting factor for B200 is the performance of Triton-generated GEMM kernels on Blackwell GPUs rather than the Helion programming model itself. Helion currently relies on Triton code generation for these kernels, and the observed performance gap largely reflects the current state of Triton GEMM performance on Blackwell hardware. Ongoing work on Helion’s CuteDSL backend is expected to further improve GEMM performance on Blackwell.\n\n## Performance Evaluation – End-to-End Model Level\n\nEnd-to-end model level benchmarking, on the other hand, highlights the user-visible impact of Helion kernels. We picked 3 different variants of Qwen3 models for this purpose:\n\n- Qwen3-1.7B\n- Qwen3-8B\n- Qwen3-32B\n\n`CudaGraph`\n\nis enabled for all model-level benchmarking traffic patterns, which varies num_tokens values ranging from 1 to 8192 at power-of-two intervals for all three Qwen3 models.\n\nTo construct the traffic pattern, we used the built-in vLLM serving benchmark with the random input data.\n\nTo minimize noise from prefix caching effects, we:\n\n- disabled prompt shuffling,\n- restarted the vLLM server before each benchmark run.\n\nHere is an example command:\n\n```\nvllm serve --model $MODEL --max-num-seqs $BATCH_SIZE --tensor-parallel-size 1 --compilation-config '{\"max_cudagraph_capture_size\": 8192, \"custom_ops\": [\"+quant_fp8\"], \"pass_config\": {\"fuse_norm_quant\": true, \"fuse_act_quant\": true, \"enable_qk_norm_rope_fusion\": true}}' \n\nvllm bench serve \\\n  --backend vllm \\\n  --model $MODEL \\\n  --endpoint /v1/completions \\\n  --dataset-name random \\\n  --num-prompts $NUM_PROMPTS \\\n  --max-concurrency $BATCH_SIZE \\\n  --input-len 512 \\\n  --output-len 600 \\\n  ----num-warmups $NUM_WARMUPS \\\n  --disable-shuffle\n```\n\n`max_cudagraph_capture_size`\n\nwas set to 8192 to match the default `max_num_batched_tokens`\n\n, ensuring all execution paths were CUDA-graph captured.\n\nAll workloads are evaluated on two NVidia GPU platforms:\n\n- NVIDIA H100\n- NVIDIA B200\n\nTo gain more insight into where performance improvements come from, we grouped the Helion kernels into three categories and benchmarked them independently as well as in combinations.\n\n**fp8_quant**: fp8 quantization kernels and fused quant kernels** qk_norm_rope**:`fused_qk_norm_rope`\n\nkernel**scaled_mm**:`scaled_mm`\n\nor`scaled_mm_blockwise`\n\nkernel.\n\n#### Dynamic per-token activation quantization\n\nWe used the following checkpoints:\n\n- RedHatAI/Qwen3-1.7B-FP8-dynamic\n- RedHatAI/Qwen3-8B-FP8-dynamic\n- RedHatAI/Qwen3-32B-FP8-dynamic\n\n*Fig. 1: Total throughput speedup on H100 with per-token activation quantization enabled, using the default vLLM setup as the baseline.*\n\nFor the 1.7B model, the results show approximately 1.05x end-to-end throughput improvement on H100 when all Helion kernel groups are enabled. For the 8B model, the improvement is most pronounced around batch size 32, which aligns with the kernel-level observations where Helion scaled_mm achieves its strongest performance around `num_tokens = 32`\n\n.\n\nWe also evaluated speculative decoding scenarios where the effective decode-phase `num_tokens`\n\nnaturally falls into this performance sweet spot.\n\nUsing:\n\n- RedHatAI/Qwen3-8B-speculator.eagle3\n- RedHatAI/Qwen3-32B-speculator.eagle3\n\nwe observed up to approximately 1.09x end-to-end throughput improvement when all Helion kernels were enabled.\n\n| Batch Size | Model | # Speculative Tokens (per-pos acc rate) | Helion TTFT\n(mean, ms) |\nDefault TTFT\n(mean, ms) |\nTTFT Speedup |\nHelion TPOT\n(mean, ms) |\nDefault TPOT (mean, ms) | TPOT Speedup |\nHelion Total Throughput\n(tok/s) |\nDefault Total Throughput\n(tok/s) |\nTotal Throughput Speedup |\n| 16 | Qwen3-8B | 1 (47%) | 34.75 | 39.93 | 1.15x |\n4.63 | 5.01 | 1.08x |\n6,314.86 | 5817.23 | 1.09x |\n| 16 | Qwen3-8B | 3 (35%, 25%, 15%) | 38.46 | 51.18 | 1.33x |\n4.40 | 4.63 | 1.05x |\n6,616.60 | 6261.1 | 1.06x |\n| 8 | Qwen3-32B | 2 (24%, 10%) | 81.92 | 100.93 | 1.23x |\n13.29 | 14.37 | 1.08x |\n1,101.61 | 1018.32 | 1.08x |\n| 8 | Qwen3-32B | 3 (24%, 10%, 4%) | 83.01 | 104.73 | 1.26x |\n13.33 | 14.21 | 1.07x |\n1,100.04 | 1030.51 | 1.07x |\n\n*Tab. 3: End-to-end benchmark results on H100 with per-token activation quantization and speculative decoding enabled. Acceptance rates for speculative tokens are reported in parentheses.*\n\nOn NVIDIA B200, we enabled only the `fp8_quant`\n\nkernel group during end-to-end evaluation. The remaining kernel groups either:\n\n- underperformed relative to the baseline (Triton limitation for Blackwell GEMMs)\n- or showed inconsistent gains across traffic patterns.\n\nEven with only the quantization-related kernels enabled, we still observed meaningful throughput improvements across all tested Qwen3 model sizes.\n\n*Fig. 2: Total throughput speedup on B200 with per-token activation quantization enabled, using the default vLLM setup as the baseline.*\n\n#### Dynamic per-group activation quantization\n\nFor per-group activation quantization, we used the following checkpoints:\n\n- Qwen/Qwen3-1.7B-FP8\n- Qwen/Qwen3-8B-FP8\n- Qwen/Qwen3-32B-FP8\n\nFor per-group activation quantization, DeepGEMM is the default backend for blockwise FP8 GEMM on both H100 and B200. However, our current per-group Helion quantization kernels are not yet compatible with the UE8M0 quantization format required by DeepGEMM. Therefore, for this experiment, we forced vLLM to use CUTLASS as the linear backend.\n\nThis means the baseline in this section is **not** the default vLLM configuration. However, the comparison is still meaningful because we are able to use consistent CUTLASS kernels for the linear layer for all runs. As a result, the measured differences come from the non-GEMM kernels being evaluated, such as FP8 quantization and fused quantization kernels, rather than from changes in the linear backend.\n\nThe following figures show enabling only the small Helion kernels still produced approximately 1.05x end-to-end throughput improvement across all workloads.\n\n*Fig. 3: Total throughput speedup on H100 and B200 with per-group activation quantization enabled, using the default vLLM setup with the linear layer backend replaced by CUTLASS as the baseline.*\n\n## Resources\n\nFor reproducibility and further exploration, all Helion kernel implementations discussed in this post are linked in the corresponding GitHub [issue](https://github.com/vllm-project/vllm/issues/32962). The same issue also includes the vLLM branches used in our experiments for reproducing the reported end-to-end benchmark results.\n\n## Caveats\n\nDuring our experiments, the majority of engineering time was spent on kernel autotuning. For large kernels such as scaled_mm, running a full-effort autotuning sweep across all three model sizes, covering a total of [168](https://github.com/xiaohongchen1991/vllm/blob/91142591ec0b2da967c600599421ee60fed4f6ca/vllm/kernels/helion/ops/scaled_mm.py#L33-L50) distinct input shapes, can take an entire day, as Helion automatically generates and benchmarks thousands of candidate kernel implementations for each shape. Initial [research](https://github.com/vllm-project/vllm/commit/5bc478ccee9bae4056aeae9953861fe587265e3f#diff-be77e79f35962c7bc20c44638613a5fdca7bb745b987888b4c63dd7557dd4207) suggests that exhaustive per-shape autotuning and dispatching may not always be necessary, and that reducing the number of specialization buckets may achieve a better tradeoff between autotuning cost and runtime performance with minimal performance degradation. The Helion team is actively exploring additional techniques to further reduce tuning time, including search-space reduction strategies and LLM-guided autotuning approaches.\n\nAnother caveat is that Helion runtime dispatching itself introduces tens of microseconds of CPU overhead per kernel launch. For small kernels, this overhead can dominate the end-to-end latency. As a result, CUDA graph capture and replay are essential for achieving optimal performance with Helion kernels. The Helion team is actively reducing the dispatch latency without CudaGraph mode.\n\n## Conclusion\n\nHelion provides a natural, PyTorch-syntax-centric approach for writing kernels in a tile-programming style. It significantly simplifies kernel development and reduces implementation effort. In our experiments, most kernels could be implemented and validated within a single day, demonstrating that Helion is a practical DSL for rapidly developing new kernels and exploring kernel fusion opportunities.\n\nCombined with its powerful AOT autotuning capability, Helion demonstrated strong potential for achieving high performance. Our experiments show that Helion kernels deliver strong performance for many kernels and consistently outperform the default vLLM implementations in most cases. For GEMM kernels, there is still room for improvement to match or exceed CUTLASS performance, particularly on Blackwell GPUs, the teams are actively working to improve it by improving Triton code gen and introducing alternative backends like CuteDSL.\n\n## Acknowledgments\n\nThis work was supported by many contributors across the OCTO and vLLM teams at Red Hat, as well as the Helion team at Meta. In particular, we would like to thank our colleagues: Luka Govedič, Richard Zou and Will Feng for their feedback and support throughout this work.", "url": "https://wpnews.pro/news/portable-vllm-model-inference-kernels-in-helion", "canonical_source": "https://pytorch.org/blog/portable-vllm-model-inference-kernels-in-helion/", "published_at": "2026-06-10 17:00:13+00:00", "updated_at": "2026-06-11 17:38:43.296442+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-chips", "machine-learning", "ai-tools"], "entities": ["vLLM", "Helion", "NVIDIA", "H100", "B200", "Qwen3", "CUTLASS", "DeepGEMM"], "alternates": {"html": "https://wpnews.pro/news/portable-vllm-model-inference-kernels-in-helion", "markdown": "https://wpnews.pro/news/portable-vllm-model-inference-kernels-in-helion.md", "text": "https://wpnews.pro/news/portable-vllm-model-inference-kernels-in-helion.txt", "jsonld": "https://wpnews.pro/news/portable-vllm-model-inference-kernels-in-helion.jsonld"}}