{"slug": "compiles-any-huggingface-model-into-a-single-persistent-megakernel", "title": "Compiles any HuggingFace model into a single persistent megakernel", "summary": "A developer open-sourced AutoMegakernel, a tool that compiles any HuggingFace model into a single persistent megakernel, reducing overhead by launching one kernel per forward pass. It includes a static validator to prevent deadlocks and races, and achieves up to 1.33x speedup on L4 GPUs for batch-1 int8 inference compared to CUDA-graphed cuBLAS bf16, though it loses on A100/H100.", "body_md": "i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel\nbatch-1 decode is bandwidth-bound. normal execution launches one kernel per op and round-trips activations through HBM dozens of times a layer. that overhead is the whole problem\nhe entire forward pass into one launch. one launch = one forward = one token\nthe hard part is a single kernel across every SM synced only by counters is a deadlock/race minefield. so the core piece is a static validator that proves any schedule deadlock-free + race-free before launch. an agent can edit the schedule freely and can't ship a hanging kernel. 7160 adversarial schedules, 6091 unsafe, zero false accepts\none source retargets sm_80 / sm_90 / sm_120. reproduces huggingface greedy decode token-for-token on real smollm2-135m\nsearch-found int8 megakernel beats cuda-graphed cuBLAS bf16 at batch-1:\nL4 up to 1.33x\nL40S 1.25-1.27x.\nit loses on A100/H100 and we say so\nllama-family only for now:p\nsc:", "url": "https://wpnews.pro/news/compiles-any-huggingface-model-into-a-single-persistent-megakernel", "canonical_source": "https://twitter.com/Akashi203/status/2067379010762338681", "published_at": "2026-06-17 22:55:25+00:00", "updated_at": "2026-06-17 23:22:36.841036+00:00", "lang": "en", "topics": ["machine-learning", "ai-tools", "ai-infrastructure", "developer-tools"], "entities": ["HuggingFace", "AutoMegakernel", "cuBLAS", "L4", "L40S", "A100", "H100", "smollm2-135m"], "alternates": {"html": "https://wpnews.pro/news/compiles-any-huggingface-model-into-a-single-persistent-megakernel", "markdown": "https://wpnews.pro/news/compiles-any-huggingface-model-into-a-single-persistent-megakernel.md", "text": "https://wpnews.pro/news/compiles-any-huggingface-model-into-a-single-persistent-megakernel.txt", "jsonld": "https://wpnews.pro/news/compiles-any-huggingface-model-into-a-single-persistent-megakernel.jsonld"}}