Show HN: Auto GPU Kernel – Autonomous GPU-kernel discovery and optimizer

wpnews.pro

cd /news/machine-learning/show-hn-auto-gpu-kernel-autonomous-g… · home › topics › machine-learning › article

[ARTICLE · art-14076] src=github.com ↗ pub=2026-05-26T04:23Z topic=machine-learning verified=true sentiment=↑ positive

Show HN: Auto GPU Kernel – Autonomous GPU-kernel discovery and optimizer

A new open-source tool, Auto GPU Kernel, autonomously discovers and optimizes GPU kernels, achieving a 34.93x average speedup to rank first in the DeepSeek Sparse Attention track of the MLSys 2026 FlashInfer AI Kernel Generation Contest. The tool uses an agentic loop with Claude Code CLI to iteratively profile, research, and optimize kernels, logging each experiment and adapting its optimization frequency as improvements diminish.

read2 min views12 publishedMay 26, 2026

Autonomous GPU-kernel discovery & optimizer.

Ranked #1 on MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x. Submissions can be found at:

Kernel	Runtime (ms)

dsa_topk_indexer_fp8_h64_d128_topk2048_ps64— DSA TopK IndexerCopy the template

directory into a separate folder / git repository to make sure your agents work in an isolated environment.

The kernel agent is compatible with FlashInfer format and can run without a local GPU on cloud using Modal. Requires Claude Code CLI.

conda create -n fi-bench python=3.12
conda activate fi-bench
pip install flashinfer-bench modal

modal setup
modal volume create flashinfer-trace
modal volume put flashinfer-trace /path/to/flashinfer-trace/

To get started clone the MLSys-2026 Contest Dataset. To change the kernel you are implementing, please refer to the FlashInfer-Trace - Bring Your Own Kernel guide.

Important

Make sure you update CLAUDE.md

to describe the kernel you are optimizing. The example in template is customized for sparse attention. Also optimize.md

and benchmark.md

has some parameters tuned for sparse attention such as number of test cases to run to get a sanity check. You can ask an agent to help you adjsut them.

To run one iteration,

claude --dangerously-skip-permissions -p "/optimize"

Or you can launch interactive mode by running claude --dangerously-skip-permissions

, selecting the right model, thinking mode and enter /loop Run /optimize every 15 minutes

That's it. The loop runs indefinitely, each iteration picks one optimization, benchmarks it, logs an experiment folder, and continues. Stop with Ctrl+C

when you want to step in. As agent struggles to find new optimizations, it will start to change its schedule to be less frequent.

For more details on the agentic loop, please refer to the technical report.

Agents:

Profiler
Research
Workload inspector

Command	Purpose
`/optimize`
Main loop
`/benchmark <quick	stride N
One-shot Modal run
`/log-experiment`
Snapshot + write `result.md` + update index

See CLAUDE.md

for rules and .claude/commands/

for full command specs.

solution/triton/sparse_fused.py

— the kernel being optimized (overwritten each iteration)experiments/exp_N/

— snapshot + results for iteration Nexperiments/summary.md

— master index, one row per iterationexperiments/LESSONS.md

— durable cross-experiment findings

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/show-hn-auto-gpu-kernel-…

Read original on github.com → github.com/Dogacel/auto-gpu-kernel

mentioned entities

FlashInfer

MLSys 2026

DeepSeek

Modal

Claude Code CLI

Hugging Face

Dogacel

metadata

slugshow-hn-auto-gpu-kernel-autonomous-gpu-kernel-discovery-and-optimizer

topic#machine-learning

secondary4 topics

sentimentpositive

canonicalgithub.com

navigation

← prevLLM Agents Are Now Finding Zero-…

next →Ollama v0.30.0-rc23: "directly s…

── more in #machine-learning 4 stories · sorted by recency

scmp.com · 10 Jul · #machine-learning

Chinese users praise OpenAI’s GPT-5.6 for efficiency, even at higher cost than local rivals

infoq.com · 10 Jul · #machine-learning

Presentation: Chaos Engineering GPU Clusters

thenextweb.com · 10 Jul · #machine-learning

MiniMax’s CEO won’t take a salary until AGI. His company just raised $2bn after an 80% crash

machinebrief.com · 10 Jul · #machine-learning

DiPhon: Breaking New Ground in Scalable Graph Generation

── more on @flashinfer 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

Anthropic's "J-lens" reveals workspace in Claude mirrors theory of consciousness

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required