LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

Researchers have developed LazyAttention, a new attention mechanism that enables zero-copy, position-agnostic key-value cache reuse for large language models by deferring positional encoding to within attention kernels. The system reduces time-to-first-token by 1.37 times and increases inference throughput by 1.40 times compared to existing methods in retrieval-augmented generation tasks, while maintaining output quality. This approach eliminates the memory materialization bottleneck that previously limited KV cache reusability in long-context applications.

arXiv:2606.04302v1 Announce Type: new Abstract: Key-value KV caching accelerates inference of large language models LLMs by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation RAG and in-context learning ICL . However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token TTFT by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.