cd/entity/vLLM· home› entities› vLLM

grep -l @vllm /news/*.json | wc -l → 154

vLLM

mentions 154 type Organization page 4/8 feed RSS

// recent coverage 154 mentions

04:00

2026-06-18

arxiv.org

large-language-models

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Researchers from Hao AI Lab introduced JetFlow, a speculative decoding framework that breaks the scaling ceiling of autoregressive LLMs by combining one-forward drafting efficiency with branch-wise ca…

02:50

2026-06-18

discuss.huggingface.co

large-language-models

Local-LLM-Launcher-GUI: For those who hate CLI flags

A new open-source GUI tool, Local-LLM-Launcher-GUI, lets users run large language models locally via vLLM or llama.cpp without memorizing command-line flags. The browser-based interface provides hardw…

01:08

2026-06-18

byteiota.com

large-language-models

MiniMax M3: Open-Weight Frontier Model at 5% of Opus Cost

MiniMax released the M3 open-weight model, claiming it costs 5% of Claude Opus per task, achieves 59% on SWE-Bench Pro, and supports a 1-million-token context window at one-twentieth the compute of it…

00:00

2026-06-18

techstackups.com

large-language-models

GLM-5.2 vs Claude Opus

Z.ai released GLM-5.2, an open-weights AI model under an MIT license, positioning it between Claude Opus 4.7 and 4.8 in performance while costing less than a fifth of Opus on output tokens. The model …

18:21

2026-06-17

dev.to

machine-learning

AI Workloads Are Reshaping Kubernetes in 2026: GPU Scheduling, MLOps, and the Platform Engineering Reckoning

By 2026, AI workloads will consume roughly 40% of enterprise Kubernetes clusters, but the default scheduler is ill-suited for GPU-intensive tasks, leading to 30-45% GPU utilization rates and wasted co…

15:49

2026-06-17

pytorch.org

machine-learning

Nominations Open for the 2026 PyTorch Foundation Contributor Awards

The PyTorch Foundation has opened nominations for its 2026 Contributor Awards, recognizing individuals who strengthen projects like PyTorch, vLLM, DeepSpeed, Ray, Helion, and Safetensors through techn…

15:00

2026-06-17

hiraditya.github.io

large-language-models

vLLM's op IR, or: where the inference engine meets the compiler

VLLM, a model-serving engine for large language models, introduced a small op-level IR to resolve the tension between acting as a compiler target and a hand-tuned kernel dispatcher. The IR allows vLLM…

12:00

2026-06-17

github.com

large-language-models

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Researchers released IndexCache, a patch for SGLang and vLLM that accelerates sparse attention in DeepSeek-V3.2 and GLM-5 models by reusing index computations across layers, achieving up to 1.82× pref…

06:33

2026-06-17

arxiv.org

machine-learning

Fearless Concurrency on the GPU

Researchers introduced cuTile Rust, a tile-based system for safe, idiomatic GPU kernel authoring in Rust that extends Rust's ownership discipline to GPU kernels. On the NVIDIA B200 GPU, cuTile Rust ac…

04:00

2026-06-17

arxiv.org

large-language-models

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

Researchers have discovered that the key/value cache in large language models can be edited and composed like notebook notes, enabling direct modification of model outputs without full recomputation. …

03:58

2026-06-17

anyscale.com

large-language-models

Ray Data LLM enables 2x throughput over vLLM’s synchronous LLM engine at production-scale

Ray Data LLM, a library for large-scale batch inference, achieves 2x throughput over vLLM's synchronous LLM engine in production-scale workloads by optimizing hardware utilization and providing fault …

20:16

2026-06-16

aws.amazon.com

artificial-intelligence

Introducing container caching in Amazon SageMaker AI for faster model scaling

Amazon Web Services announced container image caching for Amazon SageMaker AI inference, reducing end-to-end latency by up to 2x for generative AI models during scale-out events by eliminating contain…

20:08

2026-06-16

byteiota.com

large-language-models

Mellum2: JetBrains Open-Sources a 12B MoE Coding Model

JetBrains open-sourced Mellum2, a 12B Mixture-of-Experts coding model under Apache 2.0, designed for air-gapped and compliance-locked environments where external API calls are prohibited. The model us…

19:42

2026-06-16

dev.to

large-language-models

Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About

A developer deployed a Llama 3 inference endpoint on Oracle Cloud Infrastructure's OKE cluster using vLLM and NVIDIA A10 GPUs in about 20 minutes. The setup cost $1.52/hr on-demand or $0.46/hr preempt…

19:37

2026-06-16

dev.to

large-language-models

Serving any LLM using a single command line with Flama

Flama 2.0 introduces first-class support for generative AI, enabling users to download, package, and serve large language models (LLMs) via a single command line. The framework allows fetching models …

18:11

2026-06-16

the-ai-corner.com

ai-infrastructure

Inference engineering is the 80% cost cut most teams miss

Inference engineering, the craft of optimizing GPU operations during AI model inference, can cut costs by up to 80% by addressing the split between prefill and decode phases. Two teams using the same …

18:03

2026-06-16

dev.to

developer-tools

Building a self-hosted, AI-native workflow engine in Rust (180 node types, no SDK bloat)

A developer built Trigix, an open-source (MIT), self-hostable workflow automation platform with a Rust execution engine and AI nodes that can run against local models. The platform features ~180 node …

17:32

2026-06-16

newsletter.semianalysis.com

machine-learning

RL Systems Mind the Gap: Matching Trainer and Generator Throughput

Anthropic CEO Dario Amodei said reinforcement learning shows the same log-linear scaling as pre-training, but RL system efficiency is critical to afford enough training. Experiments on open models sho…

17:15

2026-06-16

github.com

large-language-models

Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

A new Rust-based reverse proxy called Sors reorders prompt content to maximize prefix cache hits in LLM inference engines like vLLM and SGLang, improving latency by placing static content before dynam…

12:31

2026-06-16

pub.towardsai.net

large-language-models

The Inference Reckoning: How to Stop Burning Millions on Cloud LLM Tokens

Enterprises are burning millions on cloud LLM tokens due to inefficient agentic systems, prompting a shift to open-weight models on dedicated infrastructure to eliminate marginal token costs and achie…

← prev page 4 / 8 next →

// co-occurs with top 8 entities

SGLang 33 NVIDIA 28 Hugging Face 24 llama.cpp 22 Ollama 17 OpenAI 17 Qwen 13 Anthropic 12