PagedAttention is more than virtual memory
PagedAttention, a memory optimization technique in the vLLM inference server, applies virtual memory concepts to manage the KV cache in large language models, improving throughput by reducing fragment…
PagedAttention, a memory optimization technique in the vLLM inference server, applies virtual memory concepts to manage the KV cache in large language models, improving throughput by reducing fragment…
Engineers achieved up to 67% cost savings and 2.7x better goodput by using Prefill-Decode disaggregation with Ray and vLLM on AMD MI325X GPUs, separating prefill and decode phases onto dedicated hardw…
Researchers introduced CacheWise, a KVCache management layer for LLM coding agents, reducing evictions by 2-2.6x and improving session completion time by up to 3.5x in vLLM, according to a June 2026 a…
NVIDIA researchers developed an agentic system for deploying machine learning models to ephemeral SageMaker endpoints, generating runtime code at deployment time from prose artifacts rather than reusa…
Phlox, an open-source self-hosted agentic web chat application, has been released on GitHub. It supports any model provider including AWS Bedrock and OpenAI-compatible endpoints, and features agentic …
A developer has released Synapse AI Gateway, an open-source governance-first AI gateway designed for regulated teams that need audit trails and policy enforcement without waiting for enterprise procur…
A developer traced a hybrid Mamba-Transformer MoE inference run and found that MoE all-to-all collective stalls dominate the tail latency, with a 69x tail ratio, despite dashboards showing 96% GPU uti…
A developer published token-sec-calc, an open-source Python CLI tool that benchmarks LLM inference throughput, latency, time-to-first-token, and queue wait against any OpenAI-compatible endpoint. The …
A new guide details how to monitor LLM inference in production using Prometheus and Grafana, covering metrics like tokens/sec, queue duration, and KV cache pressure for servers such as vLLM, TGI, and …
Independent researcher Haijun Wen of Light Ark Technologies is seeking a cs.CL endorser on arXiv to post a preprint on a model-agnostic framework for persistent AI personality, addressing memory and p…
Slopsome.com launched a free VRAM fit-calculator and real tokens-per-second database for local LLMs, enabling users to check if a model runs on specific GPUs with given quantization and context length…
Agentify has forked the archived TensorZero project, which raised $7.3M, and released Agentify Gateway, an open-source LLM gateway with observability, evaluation, optimization, and experimentation fea…
Google DeepMind released DiffusionGemma on June 10, 2026, a 26B open-weight text diffusion model that generates 256 tokens simultaneously, achieving up to 1,008 tokens per second on an H100—4-5x faste…
LMCache introduces a novel KV cache optimization layer to accelerate LLM inference, enabling faster local deployment on consumer hardware. AllenAI releases olmo-eval, a workbench for evaluating open l…
Google's DiffusionGemma, an experimental open model using discrete diffusion for text generation, offers a parallel approach that can outperform token-by-token LLMs in throughput-sensitive workloads. …
MiniMax released its M3 multimodal model on NVIDIA's accelerated infrastructure, offering a free public endpoint via NVIDIA's API catalog. The 428-billion-parameter model processes text, images, and v…
Ray Serve LLM and vLLM on AMD MI325X achieve up to 67% cost savings by disaggregating prefill and decode phases in LLM serving, separating them onto dedicated GPUs to eliminate interference and improv…
Redb.Route.Llm 3.1.1 adds seven nullable audit fields to every persisted message, capturing effective sampling parameters, a SHA-256 hash of the tool set, and the provider's system fingerprint for com…
A developer achieved 775 tokens per second running the full BF16 DiffusionGemma model on an Nvidia RTX 6000 Pro using a Red Hat fork of vLLM, demonstrating extremely fast local AI inference at short c…
Google has released DiffusionGemma, an experimental text-generation model built on the Gemma 4 architecture that generates text in parallel blocks rather than token-by-token, enabling faster inference…