{"slug": "run-llama-cpp-on-a-mac-pro-61-with-dual-firepro-d700-gpus-on-ubuntu", "title": "Run Llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu", "summary": "A user has published a guide for running llama.cpp with Vulkan on a 2013 Mac Pro 6,1 equipped with dual FirePro D700 GPUs running Ubuntu. The guide explains that the machine's 12 GB of aggregate VRAM must be treated as two separate 6 GB pools, enabling 7-billion-parameter Q4 models to run with useful context sizes while 13-billion-parameter models remain impractical due to the dual-card architecture. The guide provides hardware specifications, model-fitting recommendations, and driver stack instructions specific to the D700's GCN 1.0 architecture.", "body_md": "Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu\n\nA D700-specific guide to running llama.cpp with Vulkan on the 2013 Mac Pro: dual 6 GB FirePro cards, Ubuntu, RADV, full GPU offload, cooling, and the traps that make old GCN hardware look slower than it is.\n\nMay 26, 202612 min read\n\nRunning llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu\n\nThe 2013 Mac Pro is still a strange machine: thermally dense, beautifully overbuilt, and awkwardly dependent on two workstation GPUs that most modern ML stacks have forgotten. The D700 version is the most interesting one for local LLM work because it gives you dual AMD FirePro D700 cards with 6 GB of GDDR5 each.\n\nThat is 12 GB of aggregate VRAM, but it is not a single 12 GB GPU. Treat it as two separate 6 GB pools that llama.cpp can use well when the Vulkan backend is configured correctly.\n\n```\nMac Pro 6,1 D700 memory shape\n\n             llama.cpp Vulkan backend\n                       |\n              split-mode: layer\n                       |\n        +--------------+--------------+\n        |                             |\n  FirePro D700 0                 FirePro D700 1\n  Tahiti / GCN 1.0               Tahiti / GCN 1.0\n  6 GB GDDR5                     6 GB GDDR5\n```\n\nThe practical outcome is simple: the D700 machine can comfortably run the class of models that are annoying on a D300. Seven billion parameter Q4 models become realistic with useful context sizes. Thirteen billion parameter models are still a poor fit if you expect full GPU offload, because the Mac Pro's dual cards do not behave like one contiguous accelerator.\n\nThis guide is a D700-specific rewrite of Edward Chalupa's excellent D300 guide. The main flow is the same: Ubuntu, the amdgpu kernel driver, Mesa RADV, llama.cpp built with Vulkan, and a few settings that matter much more than they look.\n\nHardware target\n\nApple shipped three GPU tiers in the Mac Pro 6,1. The D700 is the top configuration: each card has 6 GB of GDDR5, 2048 stream processors, a 384-bit memory bus, and 264 GB/s of memory bandwidth.\n\nGPU\n\nArchitecture family\n\nVRAM per card\n\nAggregate VRAM\n\nPractical llama.cpp target\n\nFirePro D300\n\nGCN 1.0 / Pitcairn-class\n\n2 GB\n\n4 GB\n\n3B and small 4B models\n\nFirePro D500\n\nGCN 1.0 / Tahiti-class\n\n3 GB\n\n6 GB\n\n4B and some compact 7B quants\n\nFirePro D700\n\nGCN 1.0 / Tahiti-class\n\n6 GB\n\n12 GB\n\n7B Q4/Q5, sometimes 8B Q4\n\nThe important difference is not raw TFLOPS. It is memory headroom. A 7B Q4_K_M GGUF is usually around 4.0-4.5 GB before runtime buffers and KV cache. On a D300 that is a non-starter. On a D700 pair, layer splitting gives the model enough room.\n\nWhat fits\n\nUse these as planning numbers, not promises. Exact memory depends on architecture, quantization, context size, batch settings, and llama.cpp version.\n\nModel class\n\nQuant\n\nTypical GGUF size\n\nD700 verdict\n\n3B\n\nQ8_0\n\n~3.0-3.5 GB\n\nEasy, but underuses the hardware\n\n7B\n\nQ4_K_M\n\n~4.0-4.5 GB\n\nGood default target\n\n7B\n\nQ5_K_M\n\n~5.0-5.5 GB\n\nGood with conservative context\n\n8B\n\nQ4_K_M\n\n~4.5-5.0 GB\n\nUsually workable\n\n13B\n\nQ4_K_M\n\n~7.5-8.5 GB\n\nUsually not worth it on this bus\n\nThe trap is reading \"12 GB VRAM\" as \"anything under 12 GB fits.\" It does not. llama.cpp can distribute layers across devices, but each card still has a 6 GB ceiling and the runtime needs additional memory for compute buffers and KV cache.\n\n```\nWhy a 13B Q4 model is awkward\n\n  Model weights + buffers + KV cache\n  +----------------------------------+\n  | more than one D700 can hold well |\n  +----------------------------------+\n\n  Splitting helps with layers, but the old PCIe path and sync cost\n  make CPU/GPU mixed inference unattractive once full offload fails.\n```\n\nFor this machine, optimize for models that fully offload. If the model does not fit with --n-gpu-layers 99, the fallback should usually be CPU-only, not partial offload.\n\nThe driver stack\n\nThe D700 is old GCN hardware. The old radeon kernel driver can drive displays, but it is the wrong foundation for Vulkan inference. You want this stack:\n\n```\nllama-server\n  |\n  |  GGML Vulkan backend\n  v\nMesa RADV Vulkan driver\n  |\n  |  userspace Vulkan implementation\n  v\nLinux amdgpu kernel driver\n  |\n  v\nDual FirePro D700 GPUs\n```\n\nMesa documents RADV as the Vulkan driver for AMD GCN/RDNA GPUs, with the caveat that GCN 1-2 hardware may need amdgpu explicitly enabled instead of radeon. Ubuntu 24.04 often does the right thing on this Mac Pro, but you should verify rather than assume.\n\nStep 1: verify both GPUs use amdgpu\n\nStart with PCI detection:\n\n```\nlspci -nnk | grep -A3 -E \"VGA|Display|FirePro|AMD\"\n```\n\nYou want both D700 devices to report:\n\n```\nKernel driver in use: amdgpu\n```\n\nIf either card is bound to radeon, add the Southern Islands amdgpu flags:\n\n```\nsudoedit /etc/default/grub\n```\n\nSet or extend GRUB_CMDLINE_LINUX_DEFAULT:\n\n```\nradeon.si_support=0 amdgpu.si_support=1\n```\n\nThen update GRUB and reboot:\n\n```\nsudo update-grub\nsudo reboot\n```\n\nAfter reboot, check again. Do not continue until both cards are on amdgpu.\n\nStep 2: install and test Vulkan\n\nInstall the Vulkan userspace pieces and the headers llama.cpp needs during build:\n\n```\nsudo apt update\nsudo apt install -y \\\n  build-essential \\\n  cmake \\\n  curl \\\n  git \\\n  glslc \\\n  libvulkan-dev \\\n  mesa-vulkan-drivers \\\n  spirv-headers \\\n  vulkan-tools\n```\n\nNow check what Vulkan sees:\n\n```\nvulkaninfo --summary\n```\n\nFor a working D700 setup you should see two RADV devices. They may be labelled as RADV TAHITI, AMD FirePro D700, or similar depending on Mesa and kernel versions.\n\n```\nExpected shape, not exact text:\n\nDevices:\n  GPU0: RADV TAHITI / AMD FirePro D700\n  GPU1: RADV TAHITI / AMD FirePro D700\n```\n\nIf vulkaninfo sees one card, fix that before building llama.cpp. llama.cpp can only use devices exposed by the Vulkan loader.\n\nStep 3: install llama.cpp with Vulkan\n\nYou have two good options here. Start with the prebuilt Vulkan release unless you specifically need a local patch, a known commit, or a custom compiler setup.\n\nOption A: download the prebuilt Vulkan binary\n\nllama.cpp publishes release builds on GitHub, including an Ubuntu x64 Vulkan package. Download the latest one from the releases page:\n\n```\nhttps://github.com/ggml-org/llama.cpp/releases\n```\n\nLook for:\n\n``` php\nLinux -> Ubuntu x64 (Vulkan)\n```\n\nOn the machine itself, you can fetch the newest Ubuntu x64 Vulkan tarball with the GitHub API:\n\n```\nmkdir -p /opt/llama.cpp\ncd /opt/llama.cpp\n\nrelease_url=$(\n  curl -fsSL https://api.github.com/repos/ggml-org/llama.cpp/releases/latest |\n    grep \"browser_download_url\" |\n    grep \"ubuntu-vulkan-x64.tar.gz\" |\n    cut -d '\"' -f 4\n)\n\ncurl -L \"$release_url\" -o llama-vulkan.tar.gz\ntar -xzf llama-vulkan.tar.gz\n```\n\nThe extracted archive contains the runnable binaries. Depending on the release layout, they may be directly under the extracted directory rather than under build/bin. Confirm where llama-server landed:\n\n```\nfind /opt/llama.cpp -type f -name \"llama-server\" -print\n```\n\nUse that path in the systemd unit below. If it prints /opt/llama.cpp/build/bin/llama-server, the later examples can be used unchanged.\n\nOption B: build from source\n\nBuild from source when you want a specific commit or want to prove exactly which backend options are compiled in:\n\n```\ngit clone https://github.com/ggml-org/llama.cpp\ncd llama.cpp\n\ncmake -B build \\\n  -DGGML_VULKAN=ON \\\n  -DLLAMA_CURL=ON \\\n  -DCMAKE_BUILD_TYPE=Release\n\ncmake --build build --config Release -j\"$(nproc)\"\n```\n\nConfirm the binary can see backend devices:\n\n```\n./build/bin/llama-server --list-devices\n```\n\nIf your llama.cpp build is older and does not expose --list-devices, use a short llama-cli smoke test and read the startup log for ggml_vulkan.\n\nStep 4: run for full offload\n\nThe default D700 command should be something like:\n\n```\nGGML_VK_VISIBLE_DEVICES=0,1 \\\nRADV_PERFTEST=aco,gpl \\\n./build/bin/llama-server \\\n  --model /models/qwen2.5-7b-instruct-q4_k_m.gguf \\\n  --n-gpu-layers 99 \\\n  --split-mode layer \\\n  --threads 2 \\\n  --parallel 1 \\\n  --host 0.0.0.0 \\\n  --port 8088\n```\n\nOne thing worth noting if you are new to llama cpp is the --model option. If you omit this then it'll now start in router mode where it attempts to make available any models you have locally, when you first try to use one via the web ui it'll load it into memory and get it ready. However, if you are using a CLI harness like Pi, this doesn't know to tell the server to unload the model when you switch to a new one and will probably crash the server. To avoid that you can add the --models-max 1\n\nThe two settings that look optional but are not:\n\nSetting\n\nWhy it matters\n\nGGML_VK_VISIBLE_DEVICES=0,1\n\nKeeps both D700s visible to llama.cpp\n\n--split-mode layer\n\nLets llama.cpp distribute transformer layers across the two GPUs\n\n--threads 2\n\nAvoids wasting CPU on sync-heavy Vulkan submission\n\nRADV_PERFTEST=aco,gpl\n\nUses RADV's faster shader compiler and pipeline path\n\nDo not blindly set --threads to the number of Xeon threads. Once all layers are on the GPUs, extra CPU threads mostly wait on Vulkan synchronization. On this machine, high thread counts can make the desktop feel broken without improving tokens per second.\n\nStep 5: make it a service\n\nCreate a dedicated model directory and service user if you want this machine to be an always-on endpoint. Then create:\n\n```\nsudoedit /etc/systemd/system/llama-server.service\n[Unit]\nDescription=llama.cpp Vulkan inference server\nAfter=network-online.target\nWants=network-online.target\n\n[Service]\nType=simple\nUser=llama\nWorkingDirectory=/opt/llama.cpp\nEnvironment=\"GGML_VK_VISIBLE_DEVICES=0,1\"\nEnvironment=\"RADV_PERFTEST=aco,gpl\"\nExecStart=/opt/llama.cpp/build/bin/llama-server \\\n  --model /srv/models/qwen2.5-7b-instruct-q4_k_m.gguf \\\n  --n-gpu-layers 99 \\\n  --split-mode layer \\\n  --threads 2 \\\n  --parallel 1 \\\n  --host 0.0.0.0 \\\n  --port 8080\nRestart=on-failure\nRestartSec=5\n\n[Install]\nWantedBy=multi-user.target\n```\n\nRemember to omit the --model option if you want it to run in router mode\n\nEnable it:\n\n```\nsudo systemctl daemon-reload\nsudo systemctl enable --now llama-server\nsudo systemctl status llama-server\n```\n\nCheck the HTTP endpoint:\n\n```\ncurl http://localhost:8080/health\n```\n\nThen confirm VRAM is actually being used on both cards:\n\n```\nfor card in /sys/class/drm/card*/device/mem_info_vram_used; do\n  printf \"%s: \" \"$card\"\n  awk '{ printf \"%.1f MiB\\n\", $1 / 1024 / 1024 }' \"$card\"\ndone\n```\n\nThe exact numbers depend on the model, but both D700s should move substantially above idle after the model loads.\n\nCooling matters\n\nThe Mac Pro 6,1 has one thermal core and one fan. That design is elegant until both GPUs sit under sustained compute load. Install macfanctld and make the fan curve less timid:\n\n```\nsudo apt install -y macfanctld\nsudoedit /etc/macfanctl.conf\n```\n\nA reasonable starting point:\n\n```\nfan_min: 1200\ntemp_avg_floor: 45\ntemp_avg_ceiling: 58\nlog_level: 1\n```\n\nRestart and watch the log:\n\n```\nsudo systemctl restart macfanctld\nsudo tail -f /var/log/macfanctl.log\n```\n\nUnder sustained inference, you want stable temperatures, not silence. The D700s have more memory headroom than the D300s, but they also put more heat into the same small chassis.\n\nThings to avoid\n\nFlash attention\n\nDo not assume --flash-attn helps. GCN 1.0 predates the FP16 throughput assumptions that make flash attention compelling on modern hardware. Test it if you want, but make the default \"off\" until benchmarks prove otherwise.\n\n```\n# Baseline first\n./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2\n\n# Only then compare\n./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2 --flash-attn\n```\n\nPartial offload\n\nAvoid half-on-GPU, half-on-CPU configurations for models that exceed VRAM:\n\n```\n# Prefer this when it fits\n--n-gpu-layers 99\n\n# Prefer this when it does not fit\n--n-gpu-layers 0\n\n# Be suspicious of this on the Mac Pro 6,1\n--n-gpu-layers 20\n```\n\nThe D700 cards are connected through an old workstation design, not a modern high-bandwidth multi-GPU fabric. Once inference has to bounce across CPU and GPU layers, the bus and synchronization overhead can erase the benefit of acceleration.\n\nGiant context windows\n\nThe D700 memory budget looks generous until you increase context. KV cache grows with context size, layer count, embedding size, and cache precision.\n\n```\nVRAM pressure = model weights + compute buffers + KV cache\n\nKV cache roughly grows with:\n  context length x number of layers x hidden size x cache precision\n```\n\nStart at --ctx-size 4096. Move to 8192 only after watching VRAM on both cards during real prompts. You can alternatively just remove this option and allow llama cpp to decide for you, it'll pick the maximum it can fit in what VRAM is left over from loading the model.\n\nBenchmarking\n\nStop the service before benchmarking:\n\n```\nsudo systemctl stop llama-server\n```\n\nConfirm the cards are back near idle:\n\n```\ncat /sys/class/drm/card*/device/mem_info_vram_used\n```\n\nThen benchmark one variable at a time:\n\n```\nGGML_VK_VISIBLE_DEVICES=0,1 RADV_PERFTEST=aco,gpl \\\n./build/bin/llama-bench \\\n  -m /srv/models/qwen2.5-7b-instruct-q4_k_m.gguf \\\n  -ngl 99 \\\n  -t 2 \\\n  -c 4096\nload_backend: loaded RPC backend from /home/altitudelabs/llama-b9305/libggml-rpc.so\nggml_vulkan: Found 2 Vulkan devices:\nggml_vulkan: 0 = AMD Radeon R9 200 / HD 7900 Series (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none\nggml_vulkan: 1 = AMD Radeon R9 200 / HD 7900 Series (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none\nload_backend: loaded Vulkan backend from /home/altitudelabs/llama-b9305/libggml-vulkan.so\nload_backend: loaded CPU backend from /home/altitudelabs/llama-b9305/libggml-cpu-ivybridge.so\nDownloading Qwopus3.5-9B-Coder-MTP-Q4_K_M.gguf ───────────────────── 100%\n| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |\n| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |\n| qwen35 9B Q4_K - Medium        |   5.37 GiB |     9.20 B | Vulkan     |  99 |       2 |           pp512 |         40.11 ± 0.30 |\n| qwen35 9B Q4_K - Medium        |   5.37 GiB |     9.20 B | Vulkan     |  99 |       2 |           tg128 |         18.85 ± 0.02 |\n```\n\nRecord:\n\nRun\n\nModel\n\nQuant\n\nContext\n\nThreads\n\nFlash attention\n\nDecode tok/s\n\n1\n\nQwopus3.5-9B-Coder-MTP\n\nQ4_K_M\n\n4096\n\n2\n\noff\n\n18.85\n\n2\n\nQwen3.5-9B-MTP\n\nQ4_K_XL\n\n4096\n\n2\n\noff\n\n9.17\n\n3\n\nQwen3.5-9B-MTP\n\nQ4_K_M\n\n4096\n\n2\n\noff\n\n19.04\n\n4\n\nQwen2.5-Coder-7B-Instruct\n\nQ4_K_M\n\n4096\n\n2\n\noff\n\n21.39\n\nDo not compare llama-bench directly to llama-server under real API traffic. The server has slot management, sampling, tokenization, and HTTP overhead. Use bench numbers to compare configurations, not to see production throughput.\n\nThe use case for these machines\n\nThe D700 Mac Pro is not a cheap alternative to a H100 and it is not a modern gaming GPU box (although, it can actually run very well not Vulkan is enabled). Its still useful though, despite it being a bit power hungry compared to modern options:\n\nUse case\n\nFit\n\nLocal coding assistant fallback\n\nGood with a 7B Q4/Q5 model\n\nPrivate summarization endpoint\n\nGood with conservative context\n\nMulti-user chat service\n\nPoor\n\n13B+ experimentation\n\nCPU-only or use newer hardware\n\nAlways-on home lab inference\n\nGood if power cost is acceptable\n\nThe point of the D700 is not that it wins benchmarks. It is that a sunk-cost workstation can still be a reliable local inference endpoint when the model is sized correctly and the Vulkan path is configured well.\n\nOne this worth thinking about however is the running costs, these old machines can suck up 250-300w under full load, so if you are doing full time inference on them it might actually be cheaper to get a Codex / Claude subscription. You do the math and do whats best for you.\n\nChief Technology Officer writing about AI systems, software architecture, cyber security, cryptography, and the practical realities of technology leadership.", "url": "https://wpnews.pro/news/run-llama-cpp-on-a-mac-pro-61-with-dual-firepro-d700-gpus-on-ubuntu", "canonical_source": "https://matthewgribben.com/blog/mac-pro-6-1-llama-cpp-firepro-d700-vulkan-ubuntu", "published_at": "2026-05-27 03:46:05+00:00", "updated_at": "2026-05-27 03:57:34.509648+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "machine-learning", "ai-chips"], "entities": ["llama.cpp", "Mac Pro 6,1", "FirePro D700", "AMD", "RADV", "Vulkan", "Tahiti", "GCN 1.0"], "alternates": {"html": "https://wpnews.pro/news/run-llama-cpp-on-a-mac-pro-61-with-dual-firepro-d700-gpus-on-ubuntu", "markdown": "https://wpnews.pro/news/run-llama-cpp-on-a-mac-pro-61-with-dual-firepro-d700-gpus-on-ubuntu.md", "text": "https://wpnews.pro/news/run-llama-cpp-on-a-mac-pro-61-with-dual-firepro-d700-gpus-on-ubuntu.txt", "jsonld": "https://wpnews.pro/news/run-llama-cpp-on-a-mac-pro-61-with-dual-firepro-d700-gpus-on-ubuntu.jsonld"}}