23:22
2026-06-11
modal.com
large-language-models
Making FlashAttention-4 faster for inference
Modal AI engineers Charles Frye and David Wang optimized FlashAttention-4 for large language model inference, focusing on decode-heavy workloads dominated by memory bandwidth-limited token generation.โฆ