17:26
2026-06-02
kyrieblunders.bearblog.dev
machine-learning
I made a kernel 2.2x faster. It made my training loop 3x slower
A developer wrote a fused decode-attention kernel that ran 2.2ร faster than the baseline in microbenchmarks, but when integrated into a HuggingFace `generate` call for an RL training loop, the decode โฆ