Show HN: VLMs Can Respond Twice as Fast Without Losing Quality

wpnews.pro

cd /news/artificial-intelligence/show-hn-vlms-can-respond-twice-as-fa… · home › topics › artificial-intelligence › article

[ARTICLE · art-35163] src=github.com ↗ pub=2026-06-20T22:09Z topic=artificial-intelligence verified=true sentiment=↑ positive

Show HN: VLMs Can Respond Twice as Fast Without Losing Quality

A new scheduling technique called TurboPrefill reduces waiting time for Vision Language Models by nearly half, from 9.0 to 4.6 seconds, without changing model weights or architecture. The optimization, validated on Qwen2.5-VL-72B-Instruct across 4 RTX 5060 Ti GPUs, doubles prefill throughput while keeping generation speed unchanged.

read4 min views1 publishedJun 20, 2026

Show HN: VLMs Can Respond Twice as Fast Without Losing Quality — Image: source

Validation of the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models (VLMs).

TurboPrefill cut the waiting time before answer generation nearly in half: from 9.0 s to 4.6 s.

Question:

What is happening in this image? Describe the animals, their approximate number, activity, environment, and colors. Which animal appears to be the leader of the group, and what five visual clues made you reach that conclusion? Use no more than 50 words.

Example answer:

Eight giraffes are walking across a grassy wetland near a river. The animals are light brown with darker patches. The leading giraffe appears to guide the group. Clues: front position, direction of movement, spacing, head orientation, and group alignment.

Validation on Vision Language Models demonstrates that Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill can significantly reduce user waiting time before answer generation without changing model weights, architecture, quantization, prompts, or inference mathematics.

The observed improvement was achieved solely through changes in execution scheduling during the prefill stage.

Parameter	Value
Model	Qwen2.5-VL-72B-Instruct-Q4_K_M
Task	Vision-language question answering
Input	Single Full HD image (1920×1080)
GPUs	4× RTX 5060 Ti 16 GB
UBatch size	128
Split mode	Layer

Metric	Baseline	TurboPrefill
Waiting time before the response started	9.0 s	4.6 s
Prefill throughput	303 tok/s	604 tok/s
Generation throughput	8.6 tok/s	8.6 tok/s

TurboPrefill nearly halved the waiting time before the model started responding, while leaving answer generation speed unchanged.

Validation on NVIDIA Pascal GPUs also demonstrated an approximately 2.2× reduction in prefill latency, suggesting that this optimization opportunity is not tied to a particular class of hardware and will likely remain relevant for future GPU generations.

The original scheduling mechanism was proposed in:

[RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill

The original proof-of-concept implementation is available at:

https://github.com/sergey-automation/TurboPrefill

This repository validates the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models.

The objective is not to introduce a new scheduling mechanism, but to demonstrate that the original mechanism is applicable beyond text-only LLM workloads.

Reference implementation branch:

https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support

The original TurboPrefill PoC intentionally used a conservative dispatcher and left some eligible workloads on the standard llama.cpp execution path.

The current validation implementation enables additional workloads that are still within the original concept of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, but were not enabled in the first PoC.

Additional workloads currently enabled for the TurboPrefill execution path:

Execution of Text LLM workloads.
Execution of Vision Language Model (VLM) workloads.
Execution of multiple concurrent requests in multi-user server mode, provided that requests from different users are not mixed within the same TurboPrefill batch.

Work in progress.

Implementation files, scripts, input samples, and benchmark logs are published in this repository.

Experimental work in progress.

The reported results are based on the current prototype implementation. Text-model validation has been completed successfully. VLM support is still under active investigation, and additional correctness validation is required before drawing final conclusions.

files/

— modified llama.cpp source files used for the validation branch.scripts/

— scripts used to run the VLM server and resolution tests.resolution_samples/

— input images used for validation.benchmarks/

— raw benchmark reports and server logs.

The validation was performed using the following reference implementation branch:

https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support

git clone https://github.com/sergey-automation/llama.cpp.git
cd llama.cpp
git checkout turboprefill-vlm-support

Build the reference implementation.

The validation uses the following model files:

Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf

mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf

Create the expected model directory:

mkdir -p /workspace/models/Qwen2.5-VL-72B
cd /workspace/models/Qwen2.5-VL-72B

Download the main model:

wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf"

Download the multimodal projector:

wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf"

Check the files:

ls -lh /workspace/models/Qwen2.5-VL-72B

Start the VLM server with TurboPrefill disabled:

TURBOPREFILL=0 ./run_vlm_server.sh

Run the benchmark:

python3 run_vlm_resolution.py

Start the VLM server with TurboPrefill enabled:

TURBOPREFILL=1 ./run_vlm_server.sh

Run the benchmark:

python3 run_vlm_resolution.py

Input images:

resolution_samples/

Reference benchmark reports and logs:

benchmarks/

Compare generated reports against the published benchmark logs included in this repository.

If this work is useful for future implementations of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, please cite the original RFC proposal:

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/show-hn-vlms-can-respond…

Read original on github.com → github.com/sergey-automation/TurboPrefill-VLM-Va…

mentioned entities

TurboPrefill

Qwen2.5-VL-72B-Instruct

RTX 5060 Ti

NVIDIA

llama.cpp

GitHub

metadata

slugshow-hn-vlms-can-respond-twice-as-fast-without-losing-quality

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicalgithub.com

navigation

← prevBuilding and hosting a website i…

next →Bluffbench is near saturation: L…

── more in #artificial-intelligence 4 stories · sorted by recency

dev.to · 20 Jun · #artificial-intelligence

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

gist.github.com · 20 Jun · #artificial-intelligence

LFM2.5 8B A1B synthetic data. Qwen3.6 35B A3B query model, LFM2.5 response model. Formatted in LFM2.5 chat template. Not checked for safety or alignment.

github.com · 20 Jun · #artificial-intelligence

Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)

cryptobriefing.com · 20 Jun · #artificial-intelligence

IT sector rises to 38% of MSCI USA Index and 44% of MSCI EM Index as tech concentration hits historic levels

── more on @turboprefill 3 stories trending now

wpnews · 19 Jun · #artificial-intelligence

From Dream Job to 'The Gulag': Inside Staff Revolt Zuckerberg's Brutal AI Push

wpnews · 19 Jun · #artificial-intelligence

Stop Guessing Which Library to Use — I Built an AI Capability Discovery Engine

wpnews · 19 Jun · #artificial-intelligence

Joanna Stern spent one week with new Siri AI, and it’s very good

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required