Compiles any HuggingFace model into a single persistent megakernel

wpnews.pro

cd /news/machine-learning/compiles-any-huggingface-model-into-… · home › topics › machine-learning › article

[ARTICLE · art-31850] src=twitter.com ↗ pub=2026-06-17T22:55Z topic=machine-learning verified=true sentiment=↑ positive

Compiles any HuggingFace model into a single persistent megakernel

A developer open-sourced AutoMegakernel, a tool that compiles any HuggingFace model into a single persistent megakernel, reducing overhead by launching one kernel per forward pass. It includes a static validator to prevent deadlocks and races, and achieves up to 1.33x speedup on L4 GPUs for batch-1 int8 inference compared to CUDA-graphed cuBLAS bf16, though it loses on A100/H100.

read1 min views31 publishedJun 17, 2026

i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel batch-1 decode is bandwidth-bound. normal execution launches one kernel per op and round-trips activations through HBM dozens of times a layer. that overhead is the whole problem he entire forward pass into one launch. one launch = one forward = one token the hard part is a single kernel across every SM synced only by counters is a deadlock/race minefield. so the core piece is a static validator that proves any schedule deadlock-free + race-free before launch. an agent can edit the schedule freely and can't ship a hanging kernel. 7160 adversarial schedules, 6091 unsafe, zero false accepts one source retargets sm_80 / sm_90 / sm_120. reproduces huggingface greedy decode token-for-token on real smollm2-135m search-found int8 megakernel beats cuda-graphed cuBLAS bf16 at batch-1: L4 up to 1.33x L40S 1.25-1.27x. it loses on A100/H100 and we say so llama-family only for now:p sc:

source & further reading

twitter.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/compiles-any-huggingface…

Read original on twitter.com → twitter.com/Akashi203/status/2067379010762338681

mentioned entities

HuggingFace

AutoMegakernel

cuBLAS

L40S

A100

H100

smollm2-135m

metadata

slugcompiles-any-huggingface-model-into-a-single-persistent-megakernel

topic#machine-learning

secondary3 topics

sentimentpositive

canonicaltwitter.com

navigation

← prevShow HN: Righthand – Autonomous …

next →Treating Agent Reasoning as a Sp…

── more in #machine-learning 4 stories · sorted by recency

github.com · 30 Jul · #machine-learning

Kimi k3 run on RTX 5090

discuss.huggingface.co · 14 Jul · #machine-learning

Intermittent 500s on Inference Endpoint - requests not reaching the container

dev.to · 13 Jul · #machine-learning

LLM Inference Latency: Why Your 7B Model Gets 15 tok/s on a T4 but 3,500 tok/s on an H100

blog.stackademic.com · 11 Jul · #machine-learning

I Have 10 Minutes to Train an AI Model. Here’s Exactly What Happened.

── more on @huggingface 3 stories trending now

wpnews · 1 Aug · #ai-products

OpenAI Atlas Shuts Down August 9: Migration Guide

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 1 Aug · #developer-tools

I Built a Portable AI Skill That Safely Upgrades .NET Applications

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required