cd/entity/vLLM· home› entities› vLLM

grep -l @vllm /news/*.json | wc -l → 154

vLLM

mentions 154 type Organization page 1/8 feed RSS

// recent coverage 154 mentions

03:40

2026-07-04

github.com

ai-tools

Show HN: SmolSignal – signal copilot for Flipper Zero files

SmolSignal, an AI signal copilot for Flipper Zero, launches to help users analyze capture files with plain-English explanations and safety classifications. The tool reads multiple file formats, detect…

20:44

2026-07-03

tigera.io

ai-agents

Six AI agent SDKs for enterprise Kubernetes, compared

Six AI agent SDKs—LangGraph, CrewAI, Google ADK, and others—are compared for enterprise Kubernetes deployment, with most being model-agnostic and containerizable for on-premise use, though Anthropic's…

09:07

2026-07-01

glukhov.org

large-language-models

Speculative Decoding: 20-50% Faster LLM Inference

Speculative decoding accelerates large language model inference by 20-50% without quality loss, using a draft-verify mechanism that generates multiple tokens per forward pass. The technique amortizes …

03:09

2026-07-01

byteiota.com

large-language-models

MiniMax M3: Open-Weight Model That Beats GPT-5.5 on Coding

MiniMax released M3, a 428-billion-parameter open-weight model, on June 7, achieving 59.0% on SWE-Bench Pro—slightly outperforming GPT-5.5's 58.6%—at $0.30 per million input tokens, making it 16 times…

00:49

2026-07-01

cryptobriefing.com

artificial-intelligence

Agents-A1 achieves 1T-model performance through long-task training, not bigger parameters

Shanghai AI Laboratory's Agents-A1, a 35-billion-parameter Mixture-of-Experts model, matches or outperforms trillion-parameter rivals by training on longer task sequences instead of increasing paramet…

20:52

2026-06-30

news.ycombinator.com

ai-agents

Show HN: Morph Reflexes – Multi-head classifiers for agent traces

Morph Reflexes launches a multi-head classifier API that analyzes agent traces for behavioral failures like looping and user frustration. The system uses a shared LLM backbone with reused KV cache to …

20:04

2026-06-30

letsdatascience.com

large-language-models

Article Compares Continuous and Static Batching in LLM Inference

A new article compares continuous batching and static batching in LLM inference, explaining how techniques in vLLM and TGI improve throughput and reduce latency. The choice of batching strategy affect…

15:08

2026-06-30

dev.to

artificial-intelligence

Seeking Guidance on AI Platform Engineering: Distributed Systems, Scheduling, and GPU Technologies

A developer exploring AI platform engineering found that the most challenging problems lie in distributed systems and scheduling rather than machine learning itself. The analysis of GPUs, Ray, vLLM, a…

14:00

2026-06-30

blog.r-lopes.com

large-language-models

Thematic Brief — How the KV cache accelerates LLM inference on GPUs

The KV cache accelerates LLM inference on GPUs by storing prior token key/value projections instead of recomputing them, reducing per-step attention cost from quadratic to linear. Decode is memory-ban…

00:00

2026-06-30

jasonrobert.dev

artificial-intelligence

News Summary for June 30, 2026

Agentic AI systems are maturing from prototypes into production-grade infrastructure, with vLLM's Micro-Agent framework demonstrating that serving-layer orchestration can match or beat frontier models…

00:00

2026-06-30

aclanthology.org

artificial-intelligence

CUHKSZ Simultaneous Speech Translation System for IWSLT 2026

The CUHKSZ team submitted a simultaneous speech translation system to IWSLT 2026, built on Qwen3-Omni-30B-A3B with LoRA adaptation, achieving 40.5 BLEU for English→Chinese and 27.7 BLEU for English→Ge…

18:03

2026-06-29

vllm.ai

ai-infrastructure

Micro-Agent: Beat Frontier Models with Collaboration Inside Model API

VLLM Semantic Router introduces a new open-source serving primitive that turns a single model API call into a bounded collaboration among multiple micro-agents, enabling cost savings, safety enforceme…

14:18

2026-06-29

pytorch.org

developer-tools

Introducing Cross-Repository CI Relay: Scalable CI for PyTorch’s Out-of-Tree Backends

PyTorch introduced the Cross-Repository CI Relay (CRCR), an automated pipeline that triggers and tracks CI in downstream repositories whenever changes are made to pytorch/pytorch. Results are displaye…

07:05

2026-06-29

dev.to

large-language-models

"LLM Inference Optimization: The Line Item That Decides If Your AI Ships"

LLM inference optimization can reduce serving costs by 5-10x and latency by 3-5x, often determining whether an AI feature ships. The bottleneck is memory bandwidth during autoregressive decoding, and …

23:06

2026-06-28

dev.to

large-language-models

KV Cache Is Eating Your VRAM — Here's How to Estimate It Before You Run Out

An engineer provides a formula to estimate KV cache memory consumption for large language models, showing that the KV cache often becomes the bottleneck before model weights. For Llama 3.1 70B at 128K…

00:46

2026-06-28

github.com

artificial-intelligence

AMD Strix Halo RDMA Cluster Setup Guide

AMD Strix Halo cluster setup guide details how to configure a two-node system linked via Intel E810 RoCE v2 for distributed vLLM inference using Tensor Parallelism. The guide covers hardware prerequis…

15:27

2026-06-27

cefboud.com

large-language-models

Distributed LLM Inference with LLM-d

A new open-source tool called llm-d acts as an LLM-aware load balancer for distributed inference, intelligently routing requests across vLLM instances based on KV cache locality and GPU utilization. B…

10:10

2026-06-27

dev.to

ai-agents

DeerFlow 2.0 Review: ByteDance's Open SuperAgent Harness

ByteDance open-sourced DeerFlow 2.0, a long-horizon agent runtime that orchestrates sub-agents, sandboxes, persistent memory, and an extensible skill system. The project reached 74,960 GitHub stars an…

08:06

2026-06-27

github.com

ai-tools

Show HN: Brytlog – AI logger

Developer released Brytlog, an open-source AI logger that replaces raw terminal output with concise AI summaries to save developers time and money. The tool acts as a pre-processor for agentic workflo…

22:35

2026-06-26

cmart.blog

large-language-models

Inference Cards

A new plaintext markup format called Inference Cards aims to standardize how self-hosted LLM performance claims are communicated, requiring details like model variant, quantization, hardware, inferenc…

page 1 / 8 next →

// co-occurs with top 8 entities

SGLang 33 NVIDIA 28 Hugging Face 24 llama.cpp 22 Ollama 17 OpenAI 17 Qwen 13 Anthropic 12