llama-bench skipped FA on capable GPUs — b9437 corrects it

Build b9437 of llama.cpp fixes two default-value bugs in llama-bench that caused flash attention to be skipped on capable GPUs and GPU-layer count to use a legacy sentinel. The flash attention flag now defaults to 'auto', matching llama-server and llama-cli, and the GPU-layer count defaults to -1. Pre-b9437 benchmarks on FA-capable hardware require flag-matched re-runs to remain valid.

Build b9437 https://github.com/ggml-org/llama.cpp/releases , published on May 30, 2026 at 20:56 UTC , ships two targeted default-value corrections to llama-bench . Flash attention -fa shifts from a hard-coded off to auto LLAMA FLASH ATTN TYPE AUTO , and the GPU-layer count -ngl changes from the legacy sentinel 99 to -1 . Both values now match what llama-server and llama-cli already used — the bench tool was simply never updated to track them until this build. Quick Answer: Before b9437 published May 30, 2026 , llama-bench hard-coded -fa off , silently skipping flash attention even on CUDA, Metal, and Vulkan hardware. Build b9437 sets the default to -fa auto and -ngl -1 , matching llama-server and llama-cli . Any pre-b9437 baseline on FA-capable hardware needs a flag-matched re-run to remain valid. PR 23714 https://github.com/ggml-org/llama.cpp/pull/23714 , reviewed and merged by maintainers JohannesGaessler and pwilkin, adds the same -fa auto|off|on tri-state flag to llama-bench that the rest of the toolchain already supported. With LLAMA FLASH ATTN TYPE AUTO as the new default, flash attention activates automatically when the runtime detects a capable backend CUDA, Metal, Vulkan ; on CPU-only hosts it stays off with no error and no output change. | Parameter | Before b9437 | After b9437 | Behavioral impact | |---|---|---|---| -fa | off hard-coded | auto LLAMA FLASH ATTN TYPE AUTO | GPU-capable hosts bench with FA active by default; pre/post comparisons require explicit flag-matching | -ngl | 99 offload-all sentinel | -1 runtime decides | CPU-only builds no longer attempt full GPU offload; eliminates spurious CUDA errors when no GPU is present | The following verified script executed successfully, exit 0 demonstrates the behavioral gap in concrete terms — on a capable GPU, the pre-b9437 defaults schedule zero FA rows while b9437 defaults schedule one: python def old llama bench device : Before b9437, the default bench matrix used FA=0, so FA rows were skipped. return {"device": device "name" , "ngl": 0, "fa": 0} def b9437 llama bench device : b9437: default ngl=-1 and -fa auto, which enables FA on capable GPUs. fa = 1 if device "kind" == "gpu" and device "flash attn" else 0 return {"device": device "name" , "ngl": -1, "fa": fa} gpu = {"name": "CUDA0", "kind": "gpu", "flash attn": True} old = old llama bench gpu new = b9437 llama bench gpu print f"capable GPU: {gpu 'name' } flash attn={gpu 'flash attn' }" print f"pre-b9437 scheduled FA rows: {sum r 'fa' for r in old }" print f"b9437 scheduled FA rows: {sum r 'fa' for r in new }" assert sum r "fa" for r in old == 0 assert sum r "fa" for r in new == 1 Before compiling, confirm you have Git, CMake 3.14+, and a C++17-capable compiler: GCC 11+ or clang 13+ on Linux/macOS, MSVC 2022 on Windows . These are current project minimums; newer versions work fine. You also need a GGUF model file. A practical starting point is qwen3-8b-q4 k m.gguf — fetch it with huggingface-cli download or let llama-server 's --hf flag pull it at startup. The path goes into llama-bench 's -m argument. A GPU is optional but required for -fa auto to activate flash attention. Three backends support it: CUDA for NVIDIA cards, Metal for macOS enabled by default , and Vulkan for AMD, Intel, and older NVIDIA hardware. On a CPU-only host, -fa auto stays off — no error, no change to the output format, just standard attention. These steps target Linux/macOS. On Windows, substitute -j$ nproc with -j%NUMBER OF PROCESSORS% and run from a Developer Command Prompt for MSVC builds. Full platform-specific options are in docs/build.md https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md . git clone https://github.com/ggml-org/llama.cpp cd llama.cpp git log --oneline -1 The top commit should reference the -fa bench PR or show a hash at or after b9437. Continuous builds don't carry semantic version tags; cross-check against the releases page https://github.com/ggml-org/llama.cpp/releases if you're unsure. CPU-only: cmake -B build && cmake --build build --config Release -j$ nproc CUDA NVIDIA : cmake -B build -DGGML CUDA=ON && cmake --build build --config Release -j$ nproc Metal is on by default on macOS — no extra flag needed. Vulkan cross-platform AMD/Intel/NVIDIA : cmake -B build -DGGML VULKAN=ON && cmake --build build --config Release -j$ nproc ./build/bin/llama-bench -m ./models/qwen3-8b-q4 k m.gguf \ -ngl -1 -fa auto -p 512 -n 128 -r 3 -p 512 sets prompt tokens prefill throughput , -n 128 sets generated tokens generation throughput , -r 3 repeats the run three times and averages. Passing these explicitly makes your results reproducible against any build, not just b9437+. ./build/bin/llama-bench -m ./models/qwen3-8b-q4 k m.gguf \ -ngl -1 -fa auto -p 512 -n 128 -r 3 --verbose Look for flash attn = 1 in the model load output. If you see flash attn = 0 on a CUDA host, the backend was compiled without -DGGML CUDA=ON — delete your build directory and recompile with the flag. ./build/bin/llama-bench -m ./models/qwen3-8b-q4 k m.gguf \ -fa off -ngl 99 This resets both flags to their pre-b9437 defaults, giving you an apples-to-apples baseline if you have historical numbers to compare against. Any llama-bench run before b9437 used -fa off as the implicit default — even on hardware that fully supports flash attention. If you have recorded t/s numbers from those builds and your hardware supports FA, those figures captured the slower attention path without indicating it. To align old results with new defaults, either re-run old baselines with -fa off -ngl 99 matching the original behavior or re-run everything with -fa auto to get forward-comparable numbers. In either case, make the -fa state an explicit column in your benchmark output going forward. The -ngl 99 legacy default also caused a quiet footgun on CPU-only hosts: with no -ngl flag set, the runtime attempted to load all 99 layers to GPU, triggering CUDA initialization errors even with no GPU present. With -ngl -1 , the runtime skips GPU offload when no backend is detected, removing that noise from logs entirely. Multi-Token Prediction gains for Qwen 3.6 27B dense — approximately 77 to 96 t/s on an RTX 4090 , a 24% throughput increase via PR 22673 — were measured in a separate context from b9437's defaults change. If you're trying to reproduce those figures, verify the -fa state from the original run; a mismatch gives you a result that is neither the clean MTP baseline nor a combined MTP+FA measurement. Three nearby builds are worth pulling alongside b9437: --mtp-n-draft and confirm your GGUF quant is compatible. MoE variants Qwen 3.6 35B-A3B show mixed results — expert-union verifier overhead can negate the gains on consumer hardware. -fa auto sets flash attention to LLAMA FLASH ATTN TYPE AUTO , telling the runtime to enable flash attention when the backend supports it. Before b9437, llama-bench always defaulted to -fa off — unlike llama-server and llama-cli , which already had the tri-state auto|off|on flag. After b9437, all three tools use the same flag semantics. Yes, with caveats. If your original run explicitly passed -fa off , or the host hardware does not support flash attention, the numbers remain comparable. If you relied on the default and ran on FA-capable hardware — CUDA, Metal, or Vulkan — those measurements were taken without flash attention even though the GPU supported it. Re-run with matched flags to produce a clean, apples-to-apples comparison. 99 was a legacy sentinel meaning "offload all layers to GPU." The project later standardized -1 as the runtime-decides value across the toolchain. llama-bench was simply never updated to match until b9437 brought it into alignment with llama-server and llama-cli . Yes, for a local source build: pull the latest commit from ggml-org/llama.cpp https://github.com/ggml-org/llama.cpp and recompile. Tagged binary releases lag the continuous builds. Check the GitHub releases page https://github.com/ggml-org/llama.cpp/releases for a pre-built artifact if you want to skip compilation, but verify the build number includes the b9437 changes before treating it as current. Yes — b9437 closes the gap. llama-cli and llama-server already supported the -fa auto|off|on tri-state. b9437 brings llama-bench into parity, so flag semantics are now consistent across all three tools. A flag value you validated in llama-server means exactly the same thing when passed to llama-bench . After pulling b9437 or later, the immediate action is straightforward: re-baseline any llama-bench results used for regression tracking, and make the -fa state an explicit column in your output going forward. The default change is a minor toolchain alignment, but its effect on benchmark validity is concrete — any pre-b9437 run on CUDA, Metal, or Vulkan was silently measuring the slower attention path. If you're on a multi-GPU system, pull at least b9439 alongside for the iGPU default fix. And if Qwen 3.6 throughput is in your test matrix, keep Multi-Token Prediction's --mtp-n-draft flag in scope — the roughly 24% gain on dense 27B is worth measuring, but MoE variant results vary enough that you'll want numbers from your own hardware and quant configuration. Last updated: 2026-06-01. Based on llama.cpp continuous builds b9436–b9439 May 30–31, 2026 .