{"slug": "show-hn-auto-gpu-kernel-autonomous-gpu-kernel-discovery-and-optimizer", "title": "Show HN: Auto GPU Kernel – Autonomous GPU-kernel discovery and optimizer", "summary": "A new open-source tool, Auto GPU Kernel, autonomously discovers and optimizes GPU kernels, achieving a 34.93x average speedup to rank first in the DeepSeek Sparse Attention track of the MLSys 2026 FlashInfer AI Kernel Generation Contest. The tool uses an agentic loop with Claude Code CLI to iteratively profile, research, and optimize kernels, logging each experiment and adapting its optimization frequency as improvements diminish.", "body_md": "Autonomous GPU-kernel discovery & optimizer.\n\nRanked #1 on [MLSys 2026 - FlashInfer AI Kernel Generation Contest](https://mlsys26.flashinfer.ai/) for the *DeepSeek Sparse Attention (DSA)* track with an average speedup of 34.93x. Submissions can be found at:\n\n| Kernel | Runtime (ms) |\n|---|---|\n|\n\n[dsa_topk_indexer_fp8_h64_d128_topk2048_ps64](/Dogacel/auto-gpu-kernel/blob/main/dsa_topk_indexer_fp8_h64_d128_topk2048_ps64)— DSA TopK IndexerCopy the `template`\n\ndirectory into a separate folder / git repository to make sure your agents work in an isolated environment.\n\nThe kernel agent is compatible with [FlashInfer](https://github.com/flashinfer-ai/flashinfer) format and can run without a local GPU on cloud using [Modal](https://modal.com/). Requires [Claude Code CLI](https://use-claude.com/index).\n\n```\n# Python env\nconda create -n fi-bench python=3.12\nconda activate fi-bench\npip install flashinfer-bench modal\n\n# One-time environment setup\nmodal setup\nmodal volume create flashinfer-trace\nmodal volume put flashinfer-trace /path/to/flashinfer-trace/\n```\n\nTo get started clone the [MLSys-2026 Contest Dataset](https://huggingface.co/datasets/flashinfer-ai/mlsys26-contest). To change the kernel you are implementing, please refer to the [FlashInfer-Trace - Bring Your Own Kernel](https://bench.flashinfer.ai/docs/tutorials/bring-your-own-kernel) guide.\n\nImportant\n\nMake sure you update `CLAUDE.md`\n\nto describe the kernel you are optimizing. The example in template is customized for sparse attention. Also `optimize.md`\n\nand `benchmark.md`\n\nhas some parameters tuned for sparse attention such as number of test cases to run to get a sanity check. You can ask an agent to help you adjsut them.\n\nTo run one iteration,\n\n```\nclaude --dangerously-skip-permissions -p \"/optimize\"\n```\n\nOr you can launch interactive mode by running `claude --dangerously-skip-permissions`\n\n, selecting the right model, thinking mode and enter `/loop Run /optimize every 15 minutes`\n\n.\n\nThat's it. The loop runs indefinitely, each iteration picks one optimization, benchmarks it, logs an experiment folder, and continues. Stop with `Ctrl+C`\n\nwhen you want to step in. As agent struggles to find new optimizations, it will start to change its schedule to be less frequent.\n\nFor more details on the agentic loop, please refer to the technical report.\n\nAgents:\n\n- Profiler\n- Research\n- Workload inspector\n\n| Command | Purpose |\n|---|---|\n`/optimize` |\nMain loop |\n`/benchmark <quick|stride N|full>` |\nOne-shot Modal run |\n`/log-experiment` |\nSnapshot + write `result.md` + update index |\n\nSee `CLAUDE.md`\n\nfor rules and `.claude/commands/`\n\nfor full command specs.\n\n`solution/triton/sparse_fused.py`\n\n— the kernel being optimized (overwritten each iteration)`experiments/exp_N/`\n\n— snapshot + results for iteration N`experiments/summary.md`\n\n— master index, one row per iteration`experiments/LESSONS.md`\n\n— durable cross-experiment findings", "url": "https://wpnews.pro/news/show-hn-auto-gpu-kernel-autonomous-gpu-kernel-discovery-and-optimizer", "canonical_source": "https://github.com/Dogacel/auto-gpu-kernel", "published_at": "2026-05-26 04:23:12+00:00", "updated_at": "2026-05-26 04:37:47.029355+00:00", "lang": "en", "topics": ["machine-learning", "artificial-intelligence", "ai-research", "ai-infrastructure", "ai-tools"], "entities": ["FlashInfer", "MLSys 2026", "DeepSeek", "Modal", "Claude Code CLI", "Hugging Face", "Dogacel"], "alternates": {"html": "https://wpnews.pro/news/show-hn-auto-gpu-kernel-autonomous-gpu-kernel-discovery-and-optimizer", "markdown": "https://wpnews.pro/news/show-hn-auto-gpu-kernel-autonomous-gpu-kernel-discovery-and-optimizer.md", "text": "https://wpnews.pro/news/show-hn-auto-gpu-kernel-autonomous-gpu-kernel-discovery-and-optimizer.txt", "jsonld": "https://wpnews.pro/news/show-hn-auto-gpu-kernel-autonomous-gpu-kernel-discovery-and-optimizer.jsonld"}}