cd/entity/vLLM· home› entities› vLLM

grep -l @vllm /news/*.json | wc -l → 154

vLLM

mentions 154 type Organization page 2/8 feed RSS

// recent coverage 154 mentions

20:01

2026-06-26

pub.towardsai.net

large-language-models

GOSIM Paris: This Is What Open Source AI Looks Like in 2026

GOSIM Paris 2025, held at Station F on May 5-6, showcased open-source AI developments including LLMs advancing in mathematical reasoning, a call for transparency over speed, and the introduction of Ta…

14:23

2026-06-26

github.com

ai-infrastructure

Show HN: ZeroGate – API gateway to scale cloud GPUs to zero when idle

ZeroGate, an open-source event-driven cross-cloud GPU orchestration fabric, eliminates idle hardware costs in multi-tenant inference pipelines by scaling dedicated infrastructure pools to zero when de…

13:10

2026-06-26

byteiota.com

ai-infrastructure

DGX Spark June 2026: Four Nodes, 700B Models Locally

NVIDIA's June 2026 DGX Spark update introduces automated four-node clustering via Cluster Assistant, enabling local inference of models up to 700B parameters. The update also delivers a 2.6x throughpu…

12:26

2026-06-26

3hcloud.com

ai-agents

How to Set Up and Deploy an OpenClaw AI Agent on a VPS

A new guide walks users through deploying an OpenClaw AI agent on a virtual private server, balancing cost, availability, and privacy. The tutorial covers server configuration, system requirements, an…

10:30

2026-06-26

aazar.me

large-language-models

Stop generating what you already have

A developer reduced LLM extraction latency from 42 seconds to 6 seconds by replacing verbatim text copying with pointer-based extraction and splitting a single large call into multiple parallel calls.…

20:42

2026-06-25

huggingface.co

ai-infrastructure

Run a vLLM Server on HF Jobs in One Command

Hugging Face launched a one-command method to run a vLLM server on its Jobs infrastructure, enabling users to quickly deploy models for testing, evaluation, or batch generation. The feature uses the o…

12:04

2026-06-25

devclubhouse.com

large-language-models

The Real Cost of the Open-Weight Price Collapse

The launch of Z.ai's GLM 5.2 and DeepSeek V4 Flash has created a 50x price gap between open-weight APIs and closed frontier models, reshaping the build-versus-buy calculus for developers. While open-w…

11:08

2026-06-25

flama.dev

large-language-models

LLM APIs with built-in chatbot in 1 line of code

Flama 2.0 introduces a CLI tool that allows users to download, package, and serve large language models from HuggingFace with a single command, including a built-in chat interface and production-ready…

10:57

2026-06-25

github.com

ai-tools

Show HN: ParseHawk – 100% Local Document AI with API, CLI, and Web UI

ParseHawk, a new local-first document AI tool, launched on Hacker News, enabling users to extract structured JSON from PDFs, scans, and images entirely on their own hardware without sending data to th…

10:01

2026-06-25

discuss.huggingface.co

large-language-models

Deepseek? Qwen?

A single H200 GPU with 141GB HBM3e cannot comfortably run DeepSeek V4 Flash (284B total, 13B active parameters) due to VRAM constraints, even with 2TB system RAM for offloading. The model requires an …

09:50

2026-06-25

oracomputing.com

large-language-models

ORA: Smaller Models. Same Intelligence

Ora Computing launched an automated LLM compression engine that reduces model size by up to 70% with minimal accuracy loss, enabling deployment on edge devices, on-prem servers, or cloud infrastructur…

20:51

2026-06-24

blog.crossplane.io

ai-infrastructure

I built a fleet-scale inference control plane using Crossplane

A developer built Modelplane, an open-source inference control plane using Crossplane, to manage GPU fleets across clouds, neoclouds, and on-premise environments. The platform allows platform teams to…

12:31

2026-06-24

pub.towardsai.net

large-language-models

Stop Crashing and Start Cooking with vLLM on AMD and Lemonade Server

A developer achieved 3x better batch throughput with Qwen3.5 by fixing vLLM on AMD's Strix Halo using the Lemonade Server, enabling more efficient AI inference on AMD hardware.…

07:21

2026-06-24

marktechpost.com

large-language-models

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

UC San Diego researchers developed DFlash, a speculative decoding method that uses a lightweight block diffusion model to draft entire token blocks in parallel, achieving up to 6.08x speedup on Qwen3-…

06:36

2026-06-24

dev.to

large-language-models

I built an interactive 11-chapter guide to how LLM inference actually works

A developer built an 11-chapter interactive guide explaining how LLM inference works, centered around nano-vLLM, a 1,200-line Python reimplementation of the vLLM serving engine. The guide covers algor…

22:57

2026-06-23

dev.to

large-language-models

Running a LangGraph ReAct Agent in Production: OpenAI-Compatible API + Multi-Model Gateway + One-Line Tracing

A developer built a production-ready LangGraph ReAct agent that exposes an OpenAI-compatible API, supports multi-model switching via a gateway, and includes one-line tracing with Langfuse. The deploym…

18:00

2026-06-23

research.ibm.com

artificial-intelligence

Running AI on mixed hardware for speed and affordability

IBM Research, Red Hat, and NxtGen Cloud Technologies demonstrated that using llm-d to serve AI models on mixed GPU hardware can boost inference speeds by 3 to 5 times and double throughput, enabling e…

17:42

2026-06-23

devclubhouse.com

large-language-models

Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance

Developers can deploy Llama 3.1 8B behind vLLM's OpenAI-compatible API on a rented GPU instance, achieving thousands of output tokens per second through continuous batching. The tutorial covers instal…

16:00

2026-06-23

blog.skypilot.co

large-language-models

SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own

SkyPilot launched SkyPilot Endpoints, a production-ready LLM inference system that deploys and manages inference across multiple Kubernetes clusters from a single YAML specification. The system handle…

15:00

2026-06-22

hiraditya.github.io

large-language-models

Crossing the Boundary: Custom Kernels and the C++/Python ABI in vLLM

VLLM, a large-model inference serving framework, uses Python for control flow but pushes arithmetic into compiled C++ and CUDA kernels to avoid interpreter overhead. The Python/C++ boundary crossing i…

← prev page 2 / 8 next →

// co-occurs with top 8 entities

SGLang 33 NVIDIA 28 Hugging Face 24 llama.cpp 22 Ollama 17 OpenAI 17 Qwen 13 Anthropic 12