{"slug": "turboprefill-2-7x-faster-than-llama-cpp-pipeline-parallel-on-llama-3-70b", "title": "TurboPrefill: 2.7× faster than llama.cpp Pipeline Parallel on Llama-3-70B", "summary": "TurboPrefill introduces intra-prompt pipeline scheduling for multi-GPU prefill, achieving up to 2.7× faster performance than llama.cpp on Llama-3-70B by overlapping GPU stage execution. The PoC shows gains of 87-135% across various GPU configurations without altering the computational model.", "body_md": "# [RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill#24219\n\n[sergey-automation](/sergey-automation)wants to merge 1 commit into\n\n[[RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill](#top)#24219[sergey-automation](/sergey-automation) wants to merge 1 commit into\n\n[[RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill](#top)#24219\n\n[sergey-automation](/sergey-automation)wants to merge 1 commit into\n\n## Conversation\n\n# [RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill\n\n## 1. Status: Proof of Concept (PoC)\n\nThis RFC describes an experimental scheduling mechanism implemented in the TurboPrefill project and intended for discussion of the architectural approach.\n\n## 2. The Problem\n\nWith `split_mode = layer`\n\n, processing of a single request during the `prefill`\n\nphase is performed by passing each `ubatch`\n\nsequentially through the GPU pipeline.\n\nWhen processing a single `ubatch`\n\n, GPU stages are utilized sequentially: early GPUs become idle after completing their work, while later GPUs remain idle waiting for input data.\n\nThis limits GPU utilization efficiency and reduces performance scaling as the number of devices increases.\n\n## 3. Proposed Approach\n\nWithin the scope of this RFC, the term **Intra-Prompt Pipeline Scheduling** refers to the request-internal scheduling mechanism for the `prefill`\n\nphase described below.\n\nThe proposed approach does not modify the division of a request into `ubatch`\n\nunits.\n\nDuring the `prefill`\n\nphase, `ubatch`\n\ninstances are pre-classified. `ubatch`\n\ninstances requiring the standard execution order continue to be processed by the existing scheduling mechanism. `ubatch`\n\ninstances suitable for pipeline execution are routed to the Intra-Prompt Pipeline Scheduling mode implemented in TurboPrefill.\n\nIn this mode, `ubatch`\n\ninstances are accumulated until the final `ubatch`\n\nof the current batch is received, after which they are executed sequentially through the layer-split GPU pipeline.\n\nThe next `ubatch`\n\nbegins processing immediately after the previous `ubatch`\n\ncompletes execution on the corresponding GPU stage. It is therefore not necessary to wait for the previous `ubatch`\n\nto pass through all GPUs before starting execution of the next one.\n\nAs a result, the number of GPUs simultaneously performing useful work increases.\n\n## 4. Scope\n\nIntra-Prompt Pipeline Scheduling is intended to accelerate the `prefill`\n\nphase when processing large single requests on multi-GPU configurations.\n\nThe approach is designed for **Single-User Mode**, where compute resources are dedicated to a single active request.\n\n## 5. Limitations\n\nIntra-Prompt Pipeline Scheduling is a request-internal scheduling mechanism. It does not introduce parallelism between independent requests and does not replace existing multi-user batching mechanisms.\n\nThe approach is intended to accelerate the `prefill`\n\nphase and does not affect `decode`\n\nperformance.\n\nIntra-Prompt Pipeline Scheduling is applied only to `ubatch`\n\ninstances that do not contain logits output requests.\n\nFor this reason, the standard `llama-bench`\n\ntool is not suitable for evaluating this mode, since its prefill tests contain logits output requests and therefore continue to be processed by the standard scheduling path.\n\n## 6. PoC Results\n\nResults for a context length of 16,373 tokens. Full results for all tested context lengths are provided below and are available in the repository.\n\n```\nPlatform                      GPU   Baseline   TurboPrefill   Gain\n------------------------------------------------------------------\n10x P104-100                  10       77          181        +135%\n4x RTX 3090                    4     1477         2758         +87%\n5x RTX 5060 Ti                 5     1993         3886         +95%\n8x RTX 5060 Ti                 8     1963         4380        +123%\n```\n\n## 7. Computational Correctness\n\nIntra-Prompt Pipeline Scheduling does not modify the mathematical computation model.\n\nThe execution order of individual layers and the order of data propagation through each layer remain unchanged.\n\nOnly the execution scheduling order between GPU stages is modified.\n\n## 8. Discussion Topics\n\n- Possible integration into\n`ggml-backend-sched`\n\n. - Applicability to other split modes.\n- Applicability to systems with P2P/NVLink support.\n\nThis RFC is published for discussion of the architectural approach and for collecting feedback on possible future directions of development.\n\n## 9. Implementation Status\n\nThe current implementation is provided as an isolated overlay module and does not require modifications to the model architecture.\n\nThe mode can be enabled through the `TURBOPREFILL=1`\n\nenvironment variable.\n\nSource code, installation scripts, benchmark data, and measurement results are available in the separate TurboPrefill repository.\n\n## 10. Full Benchmark Results\n\nAll measurements in this section were performed using GPT-OSS-120B Q4_K_M.\n\n### Multi-GPU Scaling with and without Intra-Prompt Pipeline Scheduling\n\n*5× and 8× NVIDIA RTX 5060 Ti 16GB, UB=768.*\n\n### Prefill Throughput: Baseline vs Intra-Prompt Pipeline Scheduling (4× RTX 3090)\n\n*4× NVIDIA RTX 3090, UB=768.*\n\n### Prefill Throughput on 10× NVIDIA P104-100 (Pascal)\n\n*10× NVIDIA P104-100 (Pascal). Baseline UB=32, Baseline UB=512, Intra-Prompt Pipeline Scheduling UB=32.*\n\nAuthor: Serhii Trykhlieb\n\nProposed mechanism:\n\nIntra-Prompt Pipeline Scheduling for Multi-GPU Prefill\n\nPoC implementation:\n\nTurboPrefill\n\nRepository:\n\n[https://github.com/sergey-automation/TurboPrefill](https://github.com/sergey-automation/TurboPrefill)\n\n[github-actions](/apps/github-actions)Bot added the\n\n[ggml](/ggml-org/llama.cpp/issues?q=state%3Aopen%20label%3Aggml)\n\nJun 5, 2026\n\n|\nTurboPrefill-VLM-Validation represents a subsequent implementation of the proposed mechanism in which:\nValidation of the applicability of Application of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill nearly halved the waiting time before answer generation: from The model was asked:\nThe waiting time before answer generation was derived from the Example answer:\n## Result\n## Validation Repository\n|\n\n[Learn more about bidirectional Unicode characters](https://github.co/hiddenchars)", "url": "https://wpnews.pro/news/turboprefill-2-7x-faster-than-llama-cpp-pipeline-parallel-on-llama-3-70b", "canonical_source": "https://github.com/ggml-org/llama.cpp/pull/24219", "published_at": "2026-06-30 07:58:24+00:00", "updated_at": "2026-06-30 08:20:04.820139+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-research"], "entities": ["TurboPrefill", "llama.cpp", "Llama-3-70B", "P104-100", "RTX 3090", "RTX 5060 Ti"], "alternates": {"html": "https://wpnews.pro/news/turboprefill-2-7x-faster-than-llama-cpp-pipeline-parallel-on-llama-3-70b", "markdown": "https://wpnews.pro/news/turboprefill-2-7x-faster-than-llama-cpp-pipeline-parallel-on-llama-3-70b.md", "text": "https://wpnews.pro/news/turboprefill-2-7x-faster-than-llama-cpp-pipeline-parallel-on-llama-3-70b.txt", "jsonld": "https://wpnews.pro/news/turboprefill-2-7x-faster-than-llama-cpp-pipeline-parallel-on-llama-3-70b.jsonld"}}