i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel batch-1 decode is bandwidth-bound. normal execution launches one kernel per op and round-trips activations through HBM dozens of times a layer. that overhead is the whole problem he entire forward pass into one launch. one launch = one forward = one token the hard part is a single kernel across every SM synced only by counters is a deadlock/race minefield. so the core piece is a static validator that proves any schedule deadlock-free + race-free before launch. an agent can edit the schedule freely and can't ship a hanging kernel. 7160 adversarial schedules, 6091 unsafe, zero false accepts one source retargets sm_80 / sm_90 / sm_120. reproduces huggingface greedy decode token-for-token on real smollm2-135m search-found int8 megakernel beats cuda-graphed cuBLAS bf16 at batch-1: L4 up to 1.33x L40S 1.25-1.27x. it loses on A100/H100 and we say so llama-family only for now:p sc:
Flash-WAM: Modality-Aware Distillation for World Action Models