04:00
2026-06-15
arxiv.org
large-language-models
Efficient On-Device Diffusion LLM Inference with Mobile NPU
Researchers introduced llada.cpp, the first NPU-aware inference framework for accelerating diffusion large language models on smartphones, achieving 17x-42x latency reduction over CPU baselines while โฆ