CuteDSL

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

17:26

2026-06-02

kyrieblunders.bearblog.dev

machine-learning

I made a kernel 2.2x faster. It made my training loop 3x slower

A developer wrote a fused decode-attention kernel that ran 2.2× faster than the baseline in microbenchmarks, but when integrated into a HuggingFace `generate` call for an RL training loop, the decode …

// co-occurs with top 6 entities

HuggingFace 1 Qwen2.5-0.5B-Instruct 1 Dr. GRPO 1 GSM8K 1 A10G 1 SDPA 1

// topics top 5 topics

machine learning 1 large language models 1 artificial intelligence 1 ai research 1 ai infrastructure 1