cd/entity/vLLM· home entities vLLM
grep -l @vllm /news/*.json | wc -l → 154

vLLM

mentions 154 type Organization page 4/8 feed RSS

// recent coverage 154 mentions

02:50
2026-06-18
discuss.huggingface.co
large-language-models

Local-LLM-Launcher-GUI: For those who hate CLI flags

A new open-source GUI tool, Local-LLM-Launcher-GUI, lets users run large language models locally via vLLM or llama.cpp without memorizing command-line flags. The browser-based interface provides hardw…

01:08
2026-06-18
byteiota.com
large-language-models

MiniMax M3: Open-Weight Frontier Model at 5% of Opus Cost

MiniMax released the M3 open-weight model, claiming it costs 5% of Claude Opus per task, achieves 59% on SWE-Bench Pro, and supports a 1-million-token context window at one-twentieth the compute of it…

00:00
2026-06-18
techstackups.com
large-language-models

GLM-5.2 vs Claude Opus

Z.ai released GLM-5.2, an open-weights AI model under an MIT license, positioning it between Claude Opus 4.7 and 4.8 in performance while costing less than a fifth of Opus on output tokens. The model …

15:00
2026-06-17
hiraditya.github.io
large-language-models

vLLM's op IR, or: where the inference engine meets the compiler

VLLM, a model-serving engine for large language models, introduced a small op-level IR to resolve the tension between acting as a compiler target and a hand-tuned kernel dispatcher. The IR allows vLLM…

06:33
2026-06-17
arxiv.org
machine-learning

Fearless Concurrency on the GPU

Researchers introduced cuTile Rust, a tile-based system for safe, idiomatic GPU kernel authoring in Rust that extends Rust's ownership discipline to GPU kernels. On the NVIDIA B200 GPU, cuTile Rust ac…

20:08
2026-06-16
byteiota.com
large-language-models

Mellum2: JetBrains Open-Sources a 12B MoE Coding Model

JetBrains open-sourced Mellum2, a 12B Mixture-of-Experts coding model under Apache 2.0, designed for air-gapped and compliance-locked environments where external API calls are prohibited. The model us…

19:37
2026-06-16
dev.to
large-language-models

Serving any LLM using a single command line with Flama

Flama 2.0 introduces first-class support for generative AI, enabling users to download, package, and serve large language models (LLMs) via a single command line. The framework allows fetching models …

18:11
2026-06-16
the-ai-corner.com
ai-infrastructure

Inference engineering is the 80% cost cut most teams miss

Inference engineering, the craft of optimizing GPU operations during AI model inference, can cut costs by up to 80% by addressing the split between prefill and decode phases. Two teams using the same …

17:32
2026-06-16
newsletter.semianalysis.com
machine-learning

RL Systems Mind the Gap: Matching Trainer and Generator Throughput

Anthropic CEO Dario Amodei said reinforcement learning shows the same log-linear scaling as pre-training, but RL system efficiency is critical to afford enough training. Experiments on open models sho…

← prev page 4 / 8 next →
// co-occurs with top 8 entities