Validation of the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models (VLMs).
TurboPrefill cut the waiting time before answer generation nearly in half: from 9.0 s to 4.6 s.
Question:
What is happening in this image? Describe the animals, their approximate number, activity, environment, and colors. Which animal appears to be the leader of the group, and what five visual clues made you reach that conclusion? Use no more than 50 words.
Example answer:
Eight giraffes are walking across a grassy wetland near a river. The animals are light brown with darker patches. The leading giraffe appears to guide the group. Clues: front position, direction of movement, spacing, head orientation, and group alignment.
Validation on Vision Language Models demonstrates that Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill can significantly reduce user waiting time before answer generation without changing model weights, architecture, quantization, prompts, or inference mathematics.
The observed improvement was achieved solely through changes in execution scheduling during the prefill stage.
| Parameter | Value |
|---|---|
| Model | Qwen2.5-VL-72B-Instruct-Q4_K_M |
| Task | Vision-language question answering |
| Input | Single Full HD image (1920×1080) |
| GPUs | 4× RTX 5060 Ti 16 GB |
| UBatch size | 128 |
| Split mode | Layer |
| Metric | Baseline | TurboPrefill |
|---|---|---|
| Waiting time before the response started | 9.0 s | 4.6 s |
| Prefill throughput | 303 tok/s | 604 tok/s |
| Generation throughput | 8.6 tok/s | 8.6 tok/s |
TurboPrefill nearly halved the waiting time before the model started responding, while leaving answer generation speed unchanged.
Validation on NVIDIA Pascal GPUs also demonstrated an approximately 2.2× reduction in prefill latency, suggesting that this optimization opportunity is not tied to a particular class of hardware and will likely remain relevant for future GPU generations.
The original scheduling mechanism was proposed in:
[RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill
The original proof-of-concept implementation is available at:
https://github.com/sergey-automation/TurboPrefill
This repository validates the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models.
The objective is not to introduce a new scheduling mechanism, but to demonstrate that the original mechanism is applicable beyond text-only LLM workloads.
Reference implementation branch:
https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support
The original TurboPrefill PoC intentionally used a conservative dispatcher and left some eligible workloads on the standard llama.cpp execution path.
The current validation implementation enables additional workloads that are still within the original concept of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, but were not enabled in the first PoC.
Additional workloads currently enabled for the TurboPrefill execution path:
- Execution of Text LLM workloads.
- Execution of Vision Language Model (VLM) workloads.
- Execution of multiple concurrent requests in multi-user server mode, provided that requests from different users are not mixed within the same TurboPrefill batch.
Work in progress.
Implementation files, scripts, input samples, and benchmark logs are published in this repository.
Experimental work in progress.
The reported results are based on the current prototype implementation. Text-model validation has been completed successfully. VLM support is still under active investigation, and additional correctness validation is required before drawing final conclusions.
files/
— modified llama.cpp source files used for the validation branch.scripts/
— scripts used to run the VLM server and resolution tests.resolution_samples/
— input images used for validation.benchmarks/
— raw benchmark reports and server logs.
The validation was performed using the following reference implementation branch:
https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support
git clone https://github.com/sergey-automation/llama.cpp.git
cd llama.cpp
git checkout turboprefill-vlm-support
Build the reference implementation.
The validation uses the following model files:
Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf
mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf
Create the expected model directory:
mkdir -p /workspace/models/Qwen2.5-VL-72B
cd /workspace/models/Qwen2.5-VL-72B
Download the main model:
wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf"
Download the multimodal projector:
wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf"
Check the files:
ls -lh /workspace/models/Qwen2.5-VL-72B
Start the VLM server with TurboPrefill disabled:
TURBOPREFILL=0 ./run_vlm_server.sh
Run the benchmark:
python3 run_vlm_resolution.py
Start the VLM server with TurboPrefill enabled:
TURBOPREFILL=1 ./run_vlm_server.sh
Run the benchmark:
python3 run_vlm_resolution.py
Input images:
resolution_samples/
Reference benchmark reports and logs:
benchmarks/
Compare generated reports against the published benchmark logs included in this repository.
If this work is useful for future implementations of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, please cite the original RFC proposal: