cd/entity/vLLM· home› entities› vLLM

grep -l @vllm /news/*.json | wc -l → 154

vLLM

mentions 154 type Organization page 5/8 feed RSS

// recent coverage 154 mentions

08:45

2026-06-16

thecomputersciencebook.com

large-language-models

PagedAttention is more than virtual memory

PagedAttention, a memory optimization technique in the vLLM inference server, applies virtual memory concepts to manage the KV cache in large language models, improving throughput by reducing fragment…

06:07

2026-06-16

anyscale.com

artificial-intelligence

67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X

Engineers achieved up to 67% cost savings and 2.7x better goodput by using Prefill-Decode disaggregation with Ray and vLLM on AMD MI325X GPUs, separating prefill and decode phases onto dedicated hardw…

05:21

2026-06-16

letsdatascience.com

large-language-models

CacheWise Improves KVCache Reuse for LLM Coding Agents

Researchers introduced CacheWise, a KVCache management layer for LLM coding agents, reducing evictions by 2-2.6x and improving session completion time by up to 3.5x in vLLM, according to a June 2026 a…

23:01

2026-06-15

pub.towardsai.net

machine-learning

Agentic Inference Deployment: From Prose Skills to Deployed Endpoints

NVIDIA researchers developed an agentic system for deploying machine learning models to ephemeral SageMaker endpoints, generating runtime code at deployment time from prose artifacts rather than reusa…

21:59

2026-06-15

github.com

ai-agents

Show HN: Phlox – Open-source self-hosted agentic web chat

Phlox, an open-source self-hosted agentic web chat application, has been released on GitHub. It supports any model provider including AWS Bedrock and OpenAI-compatible endpoints, and features agentic …

17:30

2026-06-15

dev.to

ai-infrastructure

Governance-first AI gateway for teams that aren't ready for enterprise tooling

A developer has released Synapse AI Gateway, an open-source governance-first AI gateway designed for regulated teams that need audit trails and policy enforcement without waiting for enterprise procur…

13:00

2026-06-15

dev.to

large-language-models

Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look

A developer traced a hybrid Mamba-Transformer MoE inference run and found that MoE all-to-all collective stalls dominate the tail latency, with a 69x tail ratio, despite dashboards showing 96% GPU uti…

07:21

2026-06-15

corti.com

large-language-models

Measuring LLM Inference: A Practical Look at token-sec-calc I published on GitHub.

A developer published token-sec-calc, an open-source Python CLI tool that benchmarks LLM inference throughput, latency, time-to-first-token, and queue wait against any OpenAI-compatible endpoint. The …

02:34

2026-06-15

glukhov.org

large-language-models

Monitoring LLM Inference with Prometheus and Grafana (vLLM, TGI, Llama.cpp)

A new guide details how to monitor LLM inference in production using Prometheus and Grafana, covering metrics like tokens/sec, queue duration, and KV cache pressure for servers such as vLLM, TGI, and …

19:18

2026-06-14

discuss.huggingface.co

large-language-models

[cs.CL] arXiv endorsement request — production compound-AI persona & memory architecture (independent researcher)

Independent researcher Haijun Wen of Light Ark Technologies is seeking a cs.CL endorser on arXiv to post a preprint on a model-agnostic framework for persistent AI personality, addressing memory and p…

14:03

2026-06-14

discuss.huggingface.co

large-language-models

Slopsome.com - a free VRAM fit-calculator + real tokens/sec database for local LLMs

Slopsome.com launched a free VRAM fit-calculator and real tokens-per-second database for local LLMs, enabling users to check if a model runs on specific GPUs with given quantization and context length…

04:43

2026-06-14

github.com

large-language-models

Forked TensorZero after it was archived after raising $7.3M

Agentify has forked the archived TensorZero project, which raised $7.3M, and released Agentify Gateway, an open-source LLM gateway with observability, evaluation, optimization, and experimentation fea…

01:13

2026-06-14

byteiota.com

large-language-models

DiffusionGemma: Google’s 4x Faster Text Diffusion Model

Google DeepMind released DiffusionGemma on June 10, 2026, a 26B open-weight text diffusion model that generates 256 tokens simultaneously, achieving up to 1,008 tokens per second on an H100—4-5x faste…

21:33

2026-06-12

dev.to

large-language-models

LLM KV Cache Optimization, Open Model Evaluation, & Agent Engineering Skills for Local Deployment

LMCache introduces a novel KV cache optimization layer to accelerate LLM inference, enabling faster local deployment on consumer hardware. AllenAI releases olmo-eval, a workbench for evaluating open l…

18:01

2026-06-12

pub.towardsai.net

large-language-models

DiffusionGemma Developer Guide: When Parallel Text Generation Beats Token-by-Token LLMs

Google's DiffusionGemma, an experimental open model using discrete diffusion for text generation, offers a parallel approach that can outperform token-by-token LLMs in throughput-sensitive workloads. …

17:51

2026-06-12

testingcatalog.com

artificial-intelligence

MiniMax M3 launches on NVIDIA platform with Free Endpoint

MiniMax released its M3 multimodal model on NVIDIA's accelerated infrastructure, offering a free public endpoint via NVIDIA's API catalog. The 428-billion-parameter model processes text, images, and v…

00:00

2026-06-12

anyscale.com

large-language-models

Achieving Up to 67% Cost Savings with Prefill-Decode Disaggregation Using Ray + vLLM on AMD MI325X

Ray Serve LLM and vLLM on AMD MI325X achieve up to 67% cost savings by disaggregating prefill and decode phases in LLM serving, separating them onto dedicated GPUs to eliminate interference and improv…

22:48

2026-06-11

dev.to

large-language-models

redb.Route.Llm 3.1.1 — per-message audit fields for LLM compliance / replay

Redb.Route.Llm 3.1.1 adds seven nullable audit fields to every persisted message, capturing effective sampling parameters, a SHA-256 hash of the tool set, and the provider's system fingerprint for com…

21:33

2026-06-11

twitter.com

artificial-intelligence

Local AI: 775 tok/s, DiffusionGemma (BF16) on Nvidia RTX 6000 Pro

A developer achieved 775 tokens per second running the full BF16 DiffusionGemma model on an Nvidia RTX 6000 Pro using a Red Hat fork of vLLM, demonstrating extremely fast local AI inference at short c…

17:20

2026-06-11

developers.googleblog.com

large-language-models

DiffusionGemma: The Developer Guide

Google has released DiffusionGemma, an experimental text-generation model built on the Gemma 4 architecture that generates text in parallel blocks rather than token-by-token, enabling faster inference…

← prev page 5 / 8 next →

// co-occurs with top 8 entities

SGLang 33 NVIDIA 28 Hugging Face 24 llama.cpp 22 Ollama 17 OpenAI 17 Qwen 13 Anthropic 12