{"slug": "llama-bench-skipped-fa-on-capable-gpus-b9437-corrects-it", "title": "llama-bench skipped FA on capable GPUs — b9437 corrects it", "summary": "Build b9437 of llama.cpp fixes two default-value bugs in llama-bench that caused flash attention to be skipped on capable GPUs and GPU-layer count to use a legacy sentinel. The flash attention flag now defaults to 'auto', matching llama-server and llama-cli, and the GPU-layer count defaults to -1. Pre-b9437 benchmarks on FA-capable hardware require flag-matched re-runs to remain valid.", "body_md": "Build [b9437](https://github.com/ggml-org/llama.cpp/releases), published on May 30, 2026 at 20:56 UTC , ships two targeted default-value corrections to `llama-bench`\n\n. Flash attention (`-fa`\n\n) shifts from a hard-coded `off`\n\nto `auto`\n\n(`LLAMA_FLASH_ATTN_TYPE_AUTO`\n\n), and the GPU-layer count (`-ngl`\n\n) changes from the legacy sentinel `99`\n\nto `-1`\n\n. Both values now match what `llama-server`\n\nand `llama-cli`\n\nalready used — the bench tool was simply never updated to track them until this build.\n\n**Quick Answer:** Before b9437 (published May 30, 2026) , `llama-bench`\n\nhard-coded `-fa off`\n\n, silently skipping flash attention even on CUDA, Metal, and Vulkan hardware. Build b9437 sets the default to `-fa auto`\n\nand `-ngl -1`\n\n, matching `llama-server`\n\nand `llama-cli`\n\n. Any pre-b9437 baseline on FA-capable hardware needs a flag-matched re-run to remain valid.\n\n[PR #23714](https://github.com/ggml-org/llama.cpp/pull/23714) , reviewed and merged by maintainers JohannesGaessler and pwilkin, adds the same `-fa auto|off|on`\n\ntri-state flag to `llama-bench`\n\nthat the rest of the toolchain already supported. With `LLAMA_FLASH_ATTN_TYPE_AUTO`\n\nas the new default, flash attention activates automatically when the runtime detects a capable backend (CUDA, Metal, Vulkan); on CPU-only hosts it stays off with no error and no output change.\n\n| Parameter | Before b9437 | After b9437 | Behavioral impact |\n|---|---|---|---|\n`-fa` |\n`off` (hard-coded) |\n`auto` (`LLAMA_FLASH_ATTN_TYPE_AUTO` ) |\nGPU-capable hosts bench with FA active by default; pre/post comparisons require explicit flag-matching |\n`-ngl` |\n`99` (offload-all sentinel) |\n`-1` (runtime decides) |\nCPU-only builds no longer attempt full GPU offload; eliminates spurious CUDA errors when no GPU is present |\n\nThe following verified script (executed successfully, exit 0) demonstrates the behavioral gap in concrete terms — on a capable GPU, the pre-b9437 defaults schedule zero FA rows while b9437 defaults schedule one:\n\n``` python\ndef old_llama_bench(device):\n    # Before b9437, the default bench matrix used FA=0, so FA rows were skipped.\n    return [{\"device\": device[\"name\"], \"ngl\": 0, \"fa\": 0}]\n\ndef b9437_llama_bench(device):\n    # b9437: default ngl=-1 and -fa auto, which enables FA on capable GPUs.\n    fa = 1 if device[\"kind\"] == \"gpu\" and device[\"flash_attn\"] else 0\n    return [{\"device\": device[\"name\"], \"ngl\": -1, \"fa\": fa}]\n\ngpu = {\"name\": \"CUDA0\", \"kind\": \"gpu\", \"flash_attn\": True}\n\nold = old_llama_bench(gpu)\nnew = b9437_llama_bench(gpu)\n\nprint(f\"capable GPU: {gpu['name']} flash_attn={gpu['flash_attn']}\")\nprint(f\"pre-b9437 scheduled FA rows: {sum(r['fa'] for r in old)}\")\nprint(f\"b9437 scheduled FA rows: {sum(r['fa'] for r in new)}\")\nassert sum(r[\"fa\"] for r in old) == 0\nassert sum(r[\"fa\"] for r in new) == 1\n```\n\nBefore compiling, confirm you have Git, CMake 3.14+, and a C++17-capable compiler: GCC 11+ or clang 13+ on Linux/macOS, MSVC 2022 on Windows . These are current project minimums; newer versions work fine.\n\nYou also need a GGUF model file. A practical starting point is `qwen3-8b-q4_k_m.gguf`\n\n— fetch it with `huggingface-cli download`\n\nor let `llama-server`\n\n's `--hf`\n\nflag pull it at startup. The path goes into `llama-bench`\n\n's `-m`\n\nargument.\n\nA GPU is optional but required for `-fa auto`\n\nto activate flash attention. Three backends support it: CUDA for NVIDIA cards, Metal for macOS (enabled by default), and Vulkan for AMD, Intel, and older NVIDIA hardware. On a CPU-only host, `-fa auto`\n\nstays off — no error, no change to the output format, just standard attention.\n\nThese steps target Linux/macOS. On Windows, substitute `-j$(nproc)`\n\nwith `-j%NUMBER_OF_PROCESSORS%`\n\nand run from a Developer Command Prompt for MSVC builds. Full platform-specific options are in [docs/build.md](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md).\n\n```\n   git clone https://github.com/ggml-org/llama.cpp\n   cd llama.cpp\n   git log --oneline -1\n```\n\nThe top commit should reference the `-fa bench`\n\nPR or show a hash at or after b9437. Continuous builds don't carry semantic version tags; cross-check against the [releases page](https://github.com/ggml-org/llama.cpp/releases) if you're unsure.\n\nCPU-only:\n\n```\n   cmake -B build && cmake --build build --config Release -j$(nproc)\n```\n\nCUDA (NVIDIA):\n\n```\n   cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)\n```\n\nMetal is on by default on macOS — no extra flag needed. Vulkan (cross-platform AMD/Intel/NVIDIA):\n\n```\n   cmake -B build -DGGML_VULKAN=ON && cmake --build build --config Release -j$(nproc)\n./build/bin/llama-bench -m ./models/qwen3-8b-q4_k_m.gguf \\\n     -ngl -1 -fa auto -p 512 -n 128 -r 3\n```\n\n`-p 512`\n\nsets prompt tokens (prefill throughput), `-n 128`\n\nsets generated tokens (generation throughput), `-r 3`\n\nrepeats the run three times and averages. Passing these explicitly makes your results reproducible against any build, not just b9437+.\n\n```\n   ./build/bin/llama-bench -m ./models/qwen3-8b-q4_k_m.gguf \\\n     -ngl -1 -fa auto -p 512 -n 128 -r 3 --verbose\n```\n\nLook for `flash_attn = 1`\n\nin the model load output. If you see `flash_attn = 0`\n\non a CUDA host, the backend was compiled without `-DGGML_CUDA=ON`\n\n— delete your build directory and recompile with the flag.\n\n```\n   ./build/bin/llama-bench -m ./models/qwen3-8b-q4_k_m.gguf \\\n     -fa off -ngl 99\n```\n\nThis resets both flags to their pre-b9437 defaults, giving you an apples-to-apples baseline if you have historical numbers to compare against.\n\nAny `llama-bench`\n\nrun before b9437 used `-fa off`\n\nas the implicit default — even on hardware that fully supports flash attention. If you have recorded t/s numbers from those builds and your hardware supports FA, those figures captured the slower attention path without indicating it. To align old results with new defaults, either re-run old baselines with `-fa off -ngl 99`\n\n(matching the original behavior) or re-run everything with `-fa auto`\n\nto get forward-comparable numbers. In either case, make the `-fa`\n\nstate an explicit column in your benchmark output going forward.\n\nThe `-ngl 99`\n\nlegacy default also caused a quiet footgun on CPU-only hosts: with no `-ngl`\n\nflag set, the runtime attempted to load all 99 layers to GPU, triggering CUDA initialization errors even with no GPU present. With `-ngl -1`\n\n, the runtime skips GPU offload when no backend is detected, removing that noise from logs entirely.\n\nMulti-Token Prediction gains for Qwen 3.6 27B dense — approximately 77 to 96 t/s on an RTX 4090 , a 24% throughput increase via PR #22673 — were measured in a separate context from b9437's defaults change. If you're trying to reproduce those figures, verify the `-fa`\n\nstate from the original run; a mismatch gives you a result that is neither the clean MTP baseline nor a combined MTP+FA measurement.\n\nThree nearby builds are worth pulling alongside b9437:\n\n`--mtp-n-draft`\n\nand confirm your GGUF quant is compatible. MoE variants (Qwen 3.6 35B-A3B) show mixed results — expert-union verifier overhead can negate the gains on consumer hardware.`-fa auto`\n\nsets flash attention to `LLAMA_FLASH_ATTN_TYPE_AUTO`\n\n, telling the runtime to enable flash attention when the backend supports it. Before b9437, `llama-bench`\n\nalways defaulted to `-fa off`\n\n— unlike `llama-server`\n\nand `llama-cli`\n\n, which already had the tri-state `auto|off|on`\n\nflag. After b9437, all three tools use the same flag semantics.\n\nYes, with caveats. If your original run explicitly passed `-fa off`\n\n, or the host hardware does not support flash attention, the numbers remain comparable. If you relied on the default and ran on FA-capable hardware — CUDA, Metal, or Vulkan — those measurements were taken without flash attention even though the GPU supported it. Re-run with matched flags to produce a clean, apples-to-apples comparison.\n\n`99`\n\nwas a legacy sentinel meaning \"offload all layers to GPU.\" The project later standardized `-1`\n\nas the runtime-decides value across the toolchain. `llama-bench`\n\nwas simply never updated to match until b9437 brought it into alignment with `llama-server`\n\nand `llama-cli`\n\n.\n\nYes, for a local source build: pull the latest commit from [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) and recompile. Tagged binary releases lag the continuous builds. Check the [GitHub releases page](https://github.com/ggml-org/llama.cpp/releases) for a pre-built artifact if you want to skip compilation, but verify the build number includes the b9437 changes before treating it as current.\n\nYes — b9437 closes the gap. `llama-cli`\n\nand `llama-server`\n\nalready supported the `-fa auto|off|on`\n\ntri-state. b9437 brings `llama-bench`\n\ninto parity, so flag semantics are now consistent across all three tools. A flag value you validated in `llama-server`\n\nmeans exactly the same thing when passed to `llama-bench`\n\n.\n\nAfter pulling b9437 or later, the immediate action is straightforward: re-baseline any `llama-bench`\n\nresults used for regression tracking, and make the `-fa`\n\nstate an explicit column in your output going forward. The default change is a minor toolchain alignment, but its effect on benchmark validity is concrete — any pre-b9437 run on CUDA, Metal, or Vulkan was silently measuring the slower attention path.\n\nIf you're on a multi-GPU system, pull at least b9439 alongside for the iGPU default fix. And if Qwen 3.6 throughput is in your test matrix, keep Multi-Token Prediction's `--mtp-n-draft`\n\nflag in scope — the roughly 24% gain on dense 27B is worth measuring, but MoE variant results vary enough that you'll want numbers from your own hardware and quant configuration.\n\n*Last updated: 2026-06-01. Based on llama.cpp continuous builds b9436–b9439 (May 30–31, 2026) .*", "url": "https://wpnews.pro/news/llama-bench-skipped-fa-on-capable-gpus-b9437-corrects-it", "canonical_source": "https://dev.to/creeta/llama-bench-skipped-fa-on-capable-gpus-b9437-corrects-it-42ik", "published_at": "2026-06-18 09:36:49+00:00", "updated_at": "2026-06-18 09:51:25.815303+00:00", "lang": "en", "topics": ["developer-tools", "machine-learning", "large-language-models"], "entities": ["llama.cpp", "llama-bench", "llama-server", "llama-cli", "JohannesGaessler", "pwilkin", "CUDA", "Metal"], "alternates": {"html": "https://wpnews.pro/news/llama-bench-skipped-fa-on-capable-gpus-b9437-corrects-it", "markdown": "https://wpnews.pro/news/llama-bench-skipped-fa-on-capable-gpus-b9437-corrects-it.md", "text": "https://wpnews.pro/news/llama-bench-skipped-fa-on-capable-gpus-b9437-corrects-it.txt", "jsonld": "https://wpnews.pro/news/llama-bench-skipped-fa-on-capable-gpus-b9437-corrects-it.jsonld"}}