cd /news/artificial-intelligence/show-hn-vlms-can-respond-twice-as-fa… · home topics artificial-intelligence article
[ARTICLE · art-35163] src=github.com ↗ pub= topic=artificial-intelligence verified=true sentiment=↑ positive

Show HN: VLMs Can Respond Twice as Fast Without Losing Quality

A new scheduling technique called TurboPrefill reduces waiting time for Vision Language Models by nearly half, from 9.0 to 4.6 seconds, without changing model weights or architecture. The optimization, validated on Qwen2.5-VL-72B-Instruct across 4 RTX 5060 Ti GPUs, doubles prefill throughput while keeping generation speed unchanged.

read4 min views1 publishedJun 20, 2026
Show HN: VLMs Can Respond Twice as Fast Without Losing Quality
Image: source

Validation of the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models (VLMs).

TurboPrefill cut the waiting time before answer generation nearly in half: from 9.0 s to 4.6 s.

Question:

What is happening in this image? Describe the animals, their approximate number, activity, environment, and colors. Which animal appears to be the leader of the group, and what five visual clues made you reach that conclusion? Use no more than 50 words.

Example answer:

Eight giraffes are walking across a grassy wetland near a river. The animals are light brown with darker patches. The leading giraffe appears to guide the group. Clues: front position, direction of movement, spacing, head orientation, and group alignment.

Validation on Vision Language Models demonstrates that Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill can significantly reduce user waiting time before answer generation without changing model weights, architecture, quantization, prompts, or inference mathematics.

The observed improvement was achieved solely through changes in execution scheduling during the prefill stage.

Parameter Value
Model Qwen2.5-VL-72B-Instruct-Q4_K_M
Task Vision-language question answering
Input Single Full HD image (1920×1080)
GPUs 4× RTX 5060 Ti 16 GB
UBatch size 128
Split mode Layer
Metric Baseline TurboPrefill
Waiting time before the response started 9.0 s 4.6 s
Prefill throughput 303 tok/s 604 tok/s
Generation throughput 8.6 tok/s 8.6 tok/s

TurboPrefill nearly halved the waiting time before the model started responding, while leaving answer generation speed unchanged.

Validation on NVIDIA Pascal GPUs also demonstrated an approximately 2.2× reduction in prefill latency, suggesting that this optimization opportunity is not tied to a particular class of hardware and will likely remain relevant for future GPU generations.

The original scheduling mechanism was proposed in:

[RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill

The original proof-of-concept implementation is available at:

https://github.com/sergey-automation/TurboPrefill

This repository validates the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models.

The objective is not to introduce a new scheduling mechanism, but to demonstrate that the original mechanism is applicable beyond text-only LLM workloads.

Reference implementation branch:

https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support

The original TurboPrefill PoC intentionally used a conservative dispatcher and left some eligible workloads on the standard llama.cpp execution path.

The current validation implementation enables additional workloads that are still within the original concept of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, but were not enabled in the first PoC.

Additional workloads currently enabled for the TurboPrefill execution path:

  • Execution of Text LLM workloads.
  • Execution of Vision Language Model (VLM) workloads.
  • Execution of multiple concurrent requests in multi-user server mode, provided that requests from different users are not mixed within the same TurboPrefill batch.

Work in progress.

Implementation files, scripts, input samples, and benchmark logs are published in this repository.

Experimental work in progress.

The reported results are based on the current prototype implementation. Text-model validation has been completed successfully. VLM support is still under active investigation, and additional correctness validation is required before drawing final conclusions.

files/

— modified llama.cpp source files used for the validation branch.scripts/

— scripts used to run the VLM server and resolution tests.resolution_samples/

— input images used for validation.benchmarks/

— raw benchmark reports and server logs.

The validation was performed using the following reference implementation branch:

https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support

git clone https://github.com/sergey-automation/llama.cpp.git
cd llama.cpp
git checkout turboprefill-vlm-support

Build the reference implementation.

The validation uses the following model files:

Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf

mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf

Create the expected model directory:

mkdir -p /workspace/models/Qwen2.5-VL-72B
cd /workspace/models/Qwen2.5-VL-72B

Download the main model:

wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf"

Download the multimodal projector:

wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf"

Check the files:

ls -lh /workspace/models/Qwen2.5-VL-72B

Start the VLM server with TurboPrefill disabled:

TURBOPREFILL=0 ./run_vlm_server.sh

Run the benchmark:

python3 run_vlm_resolution.py

Start the VLM server with TurboPrefill enabled:

TURBOPREFILL=1 ./run_vlm_server.sh

Run the benchmark:

python3 run_vlm_resolution.py

Input images:

resolution_samples/

Reference benchmark reports and logs:

benchmarks/

Compare generated reports against the published benchmark logs included in this repository.

If this work is useful for future implementations of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, please cite the original RFC proposal:

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @turboprefill 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-vlms-can-res…] indexed:0 read:4min 2026-06-20 ·