# Compiles any HuggingFace model into a single persistent megakernel

> Source: <https://twitter.com/Akashi203/status/2067379010762338681>
> Published: 2026-06-17 22:55:25+00:00

i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel
batch-1 decode is bandwidth-bound. normal execution launches one kernel per op and round-trips activations through HBM dozens of times a layer. that overhead is the whole problem
he entire forward pass into one launch. one launch = one forward = one token
the hard part is a single kernel across every SM synced only by counters is a deadlock/race minefield. so the core piece is a static validator that proves any schedule deadlock-free + race-free before launch. an agent can edit the schedule freely and can't ship a hanging kernel. 7160 adversarial schedules, 6091 unsafe, zero false accepts
one source retargets sm_80 / sm_90 / sm_120. reproduces huggingface greedy decode token-for-token on real smollm2-135m
search-found int8 megakernel beats cuda-graphed cuBLAS bf16 at batch-1:
L4 up to 1.33x
L40S 1.25-1.27x.
it loses on A100/H100 and we say so
llama-family only for now:p
sc:
