cd/entity/vLLM· home› entities› vLLM

grep -l @vllm /news/*.json | wc -l → 154

vLLM

mentions 154 type Organization page 3/8 feed RSS

// recent coverage 154 mentions

16:09

2026-06-21

portal.neuralwatt.com

ai-infrastructure

Neuralwatt: Energy-based pricing for AI inference. Efficient prompts cost less

Neuralwatt launched the first AI inference API with energy-based pricing, charging per kilowatt-hour instead of per token to provide transparency into power consumption and cost. The platform offers r…

13:00

2026-06-21

vettedconsumer.com

large-language-models

Serving a Local LLM as an API: From Ollama's Endpoint to vLLM Throughput (and When to Rent Instead)

Local AI serving engines like Ollama and vLLM offer different trade-offs between ease of use and throughput, with Ollama ideal for single users and vLLM for high-concurrency production workloads. The …

11:47

2026-06-21

vettedconsumer.com

large-language-models

Show HN: Local LLM Hardware Calculator

A new Local LLM Hardware Calculator helps users estimate memory requirements for running large language models on their own hardware, factoring in weights, KV cache, and overhead. The tool also compar…

19:13

2026-06-20

gist.github.com

large-language-models

Running GLM-5.2 (753B DeepSeek-Sparse-Attention MoE) on 8x A100 80GB with vLLM — TRITON_MLA_SPARSE backend (PR #38476), no-recompile install, benchmarks

A developer confirmed that GLM-5.2, a 753B-parameter DeepSeek-Sparse-Attention MoE model, runs on 8x A100 80GB GPUs using vLLM PR #38476, which adds a Triton sparse-MLA backend for Ampere architecture…

14:24

2026-06-20

news.ycombinator.com

ai-infrastructure

How to become an AI infrastructure engineer?

An AI infrastructure engineer at a major industrial company seeks advice on transitioning from SRE-focused work to a proper software engineering role in AI infrastructure, asking for skills, resources…

01:36

2026-06-20

dev.to

large-language-models

KV cache and PagedAttention: what they do and why they matter

A developer explains that the KV cache is the biggest operational bottleneck in production LLM serving on GPUs, consuming more memory than model weights for workloads with high concurrency or long con…

00:17

2026-06-20

modal.com

large-language-models

Speculation Is All You Need

Modal Labs released state-of-the-art DFlash speculators for Qwen 3.5 and Qwen 3.6 models on Hugging Face, achieving 5-20% additional speedups and enabling Qwen 3.5 122B-A10B to run at over 1000 tok/s …

15:00

2026-06-19

hiraditya.github.io

large-language-models

Building vLLM from Source: A Field Guide (with all the pitfalls)

A developer building vLLM from source on an AWS g5 instance with Ubuntu 26.04 and Python 3.14 encountered multiple version-skew, driver, and toolchain issues, including a pitfall where missing nvidia-…

10:44

2026-06-19

discuss.huggingface.co

large-language-models

Gemma 4 bug fixes and Research Request

A critical bug in Google's Gemma 4 causes it to malform tool calls under real load, affecting vLLM, llama.cpp, Ollama, and oobabooga. A developer open-sourced a diagnosis, repair, and experimental LoR…

08:04

2026-06-19

corti.com

large-language-models

Connecting OpenCode to a Self-Hosted LLM (vLLM + Nemotron 3 Super)

OpenCode, an open-source terminal-first coding agent, now supports self-hosted large language models via OpenAI-compatible APIs, enabling users to connect it to a vLLM server running NVIDIA's Nemotron…

04:23

2026-06-19

github.com

ai-infrastructure

Profile(v2.1.4) physics-aware optimizer for vLLM (31→470 tok/s on A100)

Profile v2.1.4, a physics-aware optimizer for vLLM inference servers, achieved a 15x throughput increase from 31 to 470 tok/s and a 93% cost reduction on an NVIDIA A100 GPU. The tool uses roofline mat…

00:00

2026-06-19

depot.dev

developer-tools

Now available: SOCI v2 support for Depot container builds

Depot now supports SOCI v2 for container builds, enabling lazy-pulling of images to drastically reduce startup times. The feature generates a SOCI index during the build process, allowing containers t…

19:23

2026-06-18

mstar.stanford.edu

machine-learning

M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

M* (M-Star), a new serving system for multimodal models, matches or beats specialized systems by up to 2.7× on speech and image serving and 12.5× on world-model rollouts. It uses a Walk Graph abstract…

18:18

2026-06-18

dev.to

developer-tools

IDE fixes, TS 5.9 beta, Claude tool use explained

The Continue plugin v1.2.20 patches memory leaks, unhandled exceptions, and JCEF message chunking crashes across JetBrains and VS Code adapters, fixing crash vectors that cause sidebar hangs and autoc…

16:29

2026-06-18

devashish.me

large-language-models

Two Qwen3 models on one DGX Spark: the residency math

Alibaba's Qwen3-80B and Qwen3-4B models were successfully co-located on a single NVIDIA DGX Spark using vLLM containers behind a LiteLLM proxy, but the 80B model's inability to emit tool calls in auto…

16:14

2026-06-18

dev.to

artificial-intelligence

7 Open-Source AI Projects Developers Need [June 2026]

Seven open-source AI projects—Ollama, Open WebUI, Browser Use, vLLM, Unsloth, CrewAI, and Continue—are reshaping production software development in June 2026. Ollama, with 174,000+ GitHub stars, now o…

16:00

2026-06-18

cloud.google.com

large-language-models

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

Google Cloud and Anyscale announced optimizations for Ray Serve LLM on Google Kubernetes Engine (GKE) that deliver up to 5x higher throughput and 8x lower latency for large language model inference. T…

10:16

2026-06-18

dev.to

large-language-models

What GLM-5.2 Changes for Long-Horizon Coding

Zhipu AI released GLM-5.2, a large language model with a 1M-token context window, flexible effort levels, and an MIT license, targeting long-horizon coding tasks. The model introduces IndexShare, an a…

09:00

2026-06-18

anyscale.com

large-language-models

High Performance Distributed Inference with Ray Serve LLM

Ray Serve LLM, in partnership with Google Kubernetes Engine, announced major performance improvements achieving up to 4.4x higher throughput on prefill-heavy workloads and 24x higher on decode-heavy w…

06:31

2026-06-18

dev.to

large-language-models

Speculative decoding shifted our output distribution and evals missed it

Nexus Labs enabled speculative decoding in vLLM for a fine-tuned 8B model, achieving a 1.9x throughput gain, but discovered that greedy decoding with a draft model is not bit-identical to greedy decod…

← prev page 3 / 8 next →

// co-occurs with top 8 entities

SGLang 33 NVIDIA 28 Hugging Face 24 llama.cpp 22 Ollama 17 OpenAI 17 Qwen 13 Anthropic 12