{"slug": "show-hn-vlms-can-respond-twice-as-fast-without-losing-quality", "title": "Show HN: VLMs Can Respond Twice as Fast Without Losing Quality", "summary": "A new scheduling technique called TurboPrefill reduces waiting time for Vision Language Models by nearly half, from 9.0 to 4.6 seconds, without changing model weights or architecture. The optimization, validated on Qwen2.5-VL-72B-Instruct across 4 RTX 5060 Ti GPUs, doubles prefill throughput while keeping generation speed unchanged.", "body_md": "Validation of the applicability of **Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill** to Vision Language Models (VLMs).\n\nTurboPrefill cut the waiting time before answer generation nearly in half: from 9.0 s to 4.6 s.\n\nQuestion:\n\nWhat is happening in this image? Describe the animals, their approximate number, activity, environment, and colors. Which animal appears to be the leader of the group, and what five visual clues made you reach that conclusion? Use no more than 50 words.\n\nExample answer:\n\nEight giraffes are walking across a grassy wetland near a river. The animals are light brown with darker patches. The leading giraffe appears to guide the group. Clues: front position, direction of movement, spacing, head orientation, and group alignment.\n\nValidation on Vision Language Models demonstrates that **Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill** can significantly reduce user waiting time before answer generation without changing model weights, architecture, quantization, prompts, or inference mathematics.\n\nThe observed improvement was achieved solely through changes in execution scheduling during the prefill stage.\n\n| Parameter | Value |\n|---|---|\n| Model | Qwen2.5-VL-72B-Instruct-Q4_K_M |\n| Task | Vision-language question answering |\n| Input | Single Full HD image (1920×1080) |\n| GPUs | 4× RTX 5060 Ti 16 GB |\n| UBatch size | 128 |\n| Split mode | Layer |\n\n| Metric | Baseline | TurboPrefill |\n|---|---|---|\n| Waiting time before the response started | 9.0 s | 4.6 s |\n| Prefill throughput | 303 tok/s | 604 tok/s |\n| Generation throughput | 8.6 tok/s | 8.6 tok/s |\n\nTurboPrefill nearly halved the waiting time before the model started responding, while leaving answer generation speed unchanged.\n\nValidation on NVIDIA Pascal GPUs also demonstrated an approximately 2.2× reduction in prefill latency, suggesting that this optimization opportunity is not tied to a particular class of hardware and will likely remain relevant for future GPU generations.\n\nThe original scheduling mechanism was proposed in:\n\n**[RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill**\n\nThe original proof-of-concept implementation is available at:\n\n[https://github.com/sergey-automation/TurboPrefill](https://github.com/sergey-automation/TurboPrefill)\n\nThis repository validates the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models.\n\nThe objective is not to introduce a new scheduling mechanism, but to demonstrate that the original mechanism is applicable beyond text-only LLM workloads.\n\nReference implementation branch:\n\n[https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support](https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support)\n\nThe original TurboPrefill PoC intentionally used a conservative dispatcher and left some eligible workloads on the standard llama.cpp execution path.\n\nThe current validation implementation enables additional workloads that are still within the original concept of **Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill**, but were not enabled in the first PoC.\n\nAdditional workloads currently enabled for the TurboPrefill execution path:\n\n- Execution of Text LLM workloads.\n- Execution of Vision Language Model (VLM) workloads.\n- Execution of multiple concurrent requests in multi-user server mode, provided that requests from different users are not mixed within the same TurboPrefill batch.\n\nWork in progress.\n\nImplementation files, scripts, input samples, and benchmark logs are published in this repository.\n\nExperimental work in progress.\n\nThe reported results are based on the current prototype implementation. Text-model validation has been completed successfully. VLM support is still under active investigation, and additional correctness validation is required before drawing final conclusions.\n\n`files/`\n\n— modified llama.cpp source files used for the validation branch.`scripts/`\n\n— scripts used to run the VLM server and resolution tests.`resolution_samples/`\n\n— input images used for validation.`benchmarks/`\n\n— raw benchmark reports and server logs.\n\nThe validation was performed using the following reference implementation branch:\n\n[https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support](https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support)\n\n```\ngit clone https://github.com/sergey-automation/llama.cpp.git\ncd llama.cpp\ngit checkout turboprefill-vlm-support\n```\n\nBuild the reference implementation.\n\nThe validation uses the following model files:\n\n`Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf`\n\n`mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf`\n\nCreate the expected model directory:\n\n```\nmkdir -p /workspace/models/Qwen2.5-VL-72B\ncd /workspace/models/Qwen2.5-VL-72B\n```\n\nDownload the main model:\n\n```\nwget -c --content-disposition \\\n\"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf\"\n```\n\nDownload the multimodal projector:\n\n```\nwget -c --content-disposition \\\n\"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf\"\n```\n\nCheck the files:\n\n```\nls -lh /workspace/models/Qwen2.5-VL-72B\n```\n\nStart the VLM server with TurboPrefill disabled:\n\n```\nTURBOPREFILL=0 ./run_vlm_server.sh\n```\n\nRun the benchmark:\n\n```\npython3 run_vlm_resolution.py\n```\n\nStart the VLM server with TurboPrefill enabled:\n\n```\nTURBOPREFILL=1 ./run_vlm_server.sh\n```\n\nRun the benchmark:\n\n```\npython3 run_vlm_resolution.py\n```\n\nInput images:\n\n```\nresolution_samples/\n```\n\nReference benchmark reports and logs:\n\n```\nbenchmarks/\n```\n\nCompare generated reports against the published benchmark logs included in this repository.\n\nIf this work is useful for future implementations of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, please cite the original RFC proposal:", "url": "https://wpnews.pro/news/show-hn-vlms-can-respond-twice-as-fast-without-losing-quality", "canonical_source": "https://github.com/sergey-automation/TurboPrefill-VLM-Validation", "published_at": "2026-06-20 22:09:28+00:00", "updated_at": "2026-06-20 22:37:44.244527+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "computer-vision", "ai-infrastructure"], "entities": ["TurboPrefill", "Qwen2.5-VL-72B-Instruct", "RTX 5060 Ti", "NVIDIA", "llama.cpp", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/show-hn-vlms-can-respond-twice-as-fast-without-losing-quality", "markdown": "https://wpnews.pro/news/show-hn-vlms-can-respond-twice-as-fast-without-losing-quality.md", "text": "https://wpnews.pro/news/show-hn-vlms-can-respond-twice-as-fast-without-losing-quality.txt", "jsonld": "https://wpnews.pro/news/show-hn-vlms-can-respond-twice-as-fast-without-losing-quality.jsonld"}}