# Maybe the agents shouldn’t write the kernels

> Source: <https://ianbarber.blog/2026/05/27/maybe-the-agents-shouldnt-write-the-kernels/>
> Published: 2026-05-27 15:47:02+00:00

A thing you can do is take the most performance and correctness sensitive part of your stack and just ask a chatbot to write it for you. They will sometimes get it right!

Back towards the end of 2024 Ouyang et al at Stanford attempted to benchmark how often that happened with [KernelBench](https://arxiv.org/abs/2502.10517). DeepSeek R1 could one-shot 12% of simple ops, 36% of fused operators, and 2% of whole architectures. Still, things have moved on a bit in the past 18 months, and Han, Zhang et al. 1 extended the idea in

[KernelBench-X](https://arxiv.org/abs/2605.04956v2). They found:

- Writing correct kernels and performant kernels is somewhat decoupled. You can refine kernels and that mostly helps with
*correctness*: the models got more of the tasks compiling, but drops the average speedup on the way. - What you ask for matters more than how you ask. The category of the task explained 3x more of the variance than switching between different agents or other method varieties.

“Together, these results indicate that the capability boundary of current LLM-based kernel generation is not a single wall but a sequence of distinct barriers – compilability, semantic correctness, hardware efficiency and performance portability – each requiring different mechanisms to clear”

In one particular area they tried getting the models to write quantization kernels, an area known for needing numerical precision. They got 0 out of 30. The models produced running kernels, just not kernels that were, you know, right.

One thing that did stand out to me was that a lot of the baselines were eager PyTorch, so I decided to run an [experiment](https://github.com/ianbarber/kernelbench-agents-vs-compile) myself. How do these models do against a compiler, not just eager?

I took a popular model (Qwen, naturally), ran it through torch.compile with minimal settings on my DGX Spark and identified three kernels that were eating big chunks of the wall time: SwiGLU, residual+RMSNorm and the SDPA prelude. I then had ChatGPT, Claude and Kimi 2 take a run at writing those kernels for the hardware.

The results were an absolute blowout: SwiGLU 1.06x, RMSNorm 1.21x, SDPA prelude 3.91x. For the latter, Kimi stacked up three different weight matrices into one and fused multiple matmuls together. It was very impressive stuff.

Then, another suspiciously well-timed paper arrived, FASTKERNELS from Snowflake. Rather than benchmarking against PT Eager or against single-operator references, they wanted to test how the models did on real model-serving problems, with a focus on the end-to-end speedups. Their takeaway:

”agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems.”

All of those issues hurt end-to-end performance when you put them in a real model-serving context. Of the three strong kernel-generation agents 3 they tried, none beat the production baselines

“in contrast to the supra-unity numbers these agents have reported on operator-level benchmarks whose reference is PyTorch eager.”

Taking another look at my vibed-up experiment results, as the FASTKERNELS folks may have suggested, there was a catch. Several catches.

The baseline, it turned out, spent an awful lot of time doing… kernel dispatch. Even getting 3.91x speedup on SDPA prelude led to an end-to-end model speedup of… 1.007x. Not quite as exciting.

You also had to be very, very careful about *how* the agents were getting speedups.

For example, the initial correctness check accepted anything within `cos_sim >= 0.95`

of the reference kernel. Codex “won” the SwiGLU round by replacing `sigmoid(x)`

with `clamp(0.21*x + 0.5, 0, 1)`

, a straight line which diverges from sigmoid everywhere except a narrow band near zero.

It turns out this kind of thing is pretty common. The FASTKERNELS folks found a case where an agent needed to write an all-reduce kernel for cross-GPU synchronization. The test harness they were using was single GPU so the agent just no-op’d it, replacing the all-reduce with a straight tensor copy. This *“passed its checker but produces the wrong sum on every scenario of our 4-rank NCCL+Gloo harness.”*

Even when the kernels are right, and fast, it doesn’t mean they are… good? Several of the generated kernels in my experiment were somewhat unshippable due to hardcoded shapes or silent global mutations. FASTKERNELS found similar things:

“L2 failures are dominated by syntactically valid kernels that respect the per-tensor signature but violate the surrounding production contract.”

Which I think is the academic way of saying they wouldn’t ship those either.

If you get your verification of the problem wrong in the harness, you will get a kernel optimized for *the harness*. Use the wrong contract and your kernel will be wrong in exactly the shape of your wrongness.

Still, a small win is a win, right! My original run had agents outperforming `torch.compile`

by about 2.6%. At that point I had a friend take a look who immediately pointed out that I had hampered the compiler unrealistically, and suggested running on `max-autotune`

. This was especially unfair since the agents each got several cracks at the problem. Turns out, with that baseline the agents *lost* by 4.6%.

And, that’s pretty similar to what FASTKERNELS found. Across 88 tasks and three agents the best of theirs landed at about 0.94x 4 the performance of the production stack.

Fairly late in the day I decided to replicate the experiments I had run on the Spark on a 3090. That’s Ampere, `sm_86`

, an elder statesman of consumer GPUs at this point. It turns out that once again, some of the wins were just worse baselines. For example, Kimi tried the same SDPA-prelude matrix stacking as on the GB10, but on Ampere the 3.91x speedup turned into a 0.74x loss. The difference was *cuBLAS*: it was simply better tuned for the 3090 than the GB10, and did a much better job of utilizing all 82 SMs. The baseline Kimi had to beat was (relatively) higher.

The question of “do agents beat compilers” is hard to answer because what we are (roughly!) measuring is compiler maturity. Agents are most useful in exactly the window where a compiler is weakest: new silicon, untuned heuristics, and libraries that are still evolving 5.

As libraries improve, hardware is better understood, and compilers mature, the value of exploratory search diminishes: there are “right ways” and it’s better to just use them than create custom solutions. If an agent is identifying patterns reliably and repeatably, it may as well author a compiler pass and spend more tokens on the areas that *can’t* be as cleanly captured.

- I think these folks are associated with Tsinghua, but to be honest I am not entirely sure!
[↩︎](#7defa4b0-922b-46b5-b3eb-f6724c22e2de-link) - Each model ran in their respective coding harnesses. One fun takeaway was the wall-time for generating the kernels was a factor too. Kimi took at least 3x longer than the other agents, spending a lot of tokens on the way, but also generated the most performant kernels of the three on every task on Blackwell, which was not what I was expecting.
[↩︎](#b899f72d-8654-4c38-887e-0ccdc9d53d5a-link) - Codex, KernelAgent and Dr. Kernel, the latter of which I hadn’t heard of but has by far the best name.
[↩︎](#155c0e21-96c6-4e20-8827-d71c50442547-link) - Codex landed at 0.94x. KernelAgent at 0.78x. Dr. Kernel got 0.53x, but still billed my insurance.
[↩︎](#dcdfa1c0-e778-406d-ba96-9a4ab8ded5d6-link) - I suspect this is particularly pointed for the GB10, which is an unusual piece of hardware, and in particular has a lowish number of SMs.
[↩︎](#d4c2d701-97bf-4805-a7eb-75be4fa0532b-link)
