Run Llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu

A user has published a guide for running llama.cpp with Vulkan on a 2013 Mac Pro 6,1 equipped with dual FirePro D700 GPUs running Ubuntu. The guide explains that the machine's 12 GB of aggregate VRAM must be treated as two separate 6 GB pools, enabling 7-billion-parameter Q4 models to run with useful context sizes while 13-billion-parameter models remain impractical due to the dual-card architecture. The guide provides hardware specifications, model-fitting recommendations, and driver stack instructions specific to the D700's GCN 1.0 architecture.

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu A D700-specific guide to running llama.cpp with Vulkan on the 2013 Mac Pro: dual 6 GB FirePro cards, Ubuntu, RADV, full GPU offload, cooling, and the traps that make old GCN hardware look slower than it is. May 26, 202612 min read Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu The 2013 Mac Pro is still a strange machine: thermally dense, beautifully overbuilt, and awkwardly dependent on two workstation GPUs that most modern ML stacks have forgotten. The D700 version is the most interesting one for local LLM work because it gives you dual AMD FirePro D700 cards with 6 GB of GDDR5 each. That is 12 GB of aggregate VRAM, but it is not a single 12 GB GPU. Treat it as two separate 6 GB pools that llama.cpp can use well when the Vulkan backend is configured correctly. Mac Pro 6,1 D700 memory shape llama.cpp Vulkan backend | split-mode: layer | +--------------+--------------+ | | FirePro D700 0 FirePro D700 1 Tahiti / GCN 1.0 Tahiti / GCN 1.0 6 GB GDDR5 6 GB GDDR5 The practical outcome is simple: the D700 machine can comfortably run the class of models that are annoying on a D300. Seven billion parameter Q4 models become realistic with useful context sizes. Thirteen billion parameter models are still a poor fit if you expect full GPU offload, because the Mac Pro's dual cards do not behave like one contiguous accelerator. This guide is a D700-specific rewrite of Edward Chalupa's excellent D300 guide. The main flow is the same: Ubuntu, the amdgpu kernel driver, Mesa RADV, llama.cpp built with Vulkan, and a few settings that matter much more than they look. Hardware target Apple shipped three GPU tiers in the Mac Pro 6,1. The D700 is the top configuration: each card has 6 GB of GDDR5, 2048 stream processors, a 384-bit memory bus, and 264 GB/s of memory bandwidth. GPU Architecture family VRAM per card Aggregate VRAM Practical llama.cpp target FirePro D300 GCN 1.0 / Pitcairn-class 2 GB 4 GB 3B and small 4B models FirePro D500 GCN 1.0 / Tahiti-class 3 GB 6 GB 4B and some compact 7B quants FirePro D700 GCN 1.0 / Tahiti-class 6 GB 12 GB 7B Q4/Q5, sometimes 8B Q4 The important difference is not raw TFLOPS. It is memory headroom. A 7B Q4 K M GGUF is usually around 4.0-4.5 GB before runtime buffers and KV cache. On a D300 that is a non-starter. On a D700 pair, layer splitting gives the model enough room. What fits Use these as planning numbers, not promises. Exact memory depends on architecture, quantization, context size, batch settings, and llama.cpp version. Model class Quant Typical GGUF size D700 verdict 3B Q8 0 ~3.0-3.5 GB Easy, but underuses the hardware 7B Q4 K M ~4.0-4.5 GB Good default target 7B Q5 K M ~5.0-5.5 GB Good with conservative context 8B Q4 K M ~4.5-5.0 GB Usually workable 13B Q4 K M ~7.5-8.5 GB Usually not worth it on this bus The trap is reading "12 GB VRAM" as "anything under 12 GB fits." It does not. llama.cpp can distribute layers across devices, but each card still has a 6 GB ceiling and the runtime needs additional memory for compute buffers and KV cache. Why a 13B Q4 model is awkward Model weights + buffers + KV cache +----------------------------------+ | more than one D700 can hold well | +----------------------------------+ Splitting helps with layers, but the old PCIe path and sync cost make CPU/GPU mixed inference unattractive once full offload fails. For this machine, optimize for models that fully offload. If the model does not fit with --n-gpu-layers 99, the fallback should usually be CPU-only, not partial offload. The driver stack The D700 is old GCN hardware. The old radeon kernel driver can drive displays, but it is the wrong foundation for Vulkan inference. You want this stack: llama-server | | GGML Vulkan backend v Mesa RADV Vulkan driver | | userspace Vulkan implementation v Linux amdgpu kernel driver | v Dual FirePro D700 GPUs Mesa documents RADV as the Vulkan driver for AMD GCN/RDNA GPUs, with the caveat that GCN 1-2 hardware may need amdgpu explicitly enabled instead of radeon. Ubuntu 24.04 often does the right thing on this Mac Pro, but you should verify rather than assume. Step 1: verify both GPUs use amdgpu Start with PCI detection: lspci -nnk | grep -A3 -E "VGA|Display|FirePro|AMD" You want both D700 devices to report: Kernel driver in use: amdgpu If either card is bound to radeon, add the Southern Islands amdgpu flags: sudoedit /etc/default/grub Set or extend GRUB CMDLINE LINUX DEFAULT: radeon.si support=0 amdgpu.si support=1 Then update GRUB and reboot: sudo update-grub sudo reboot After reboot, check again. Do not continue until both cards are on amdgpu. Step 2: install and test Vulkan Install the Vulkan userspace pieces and the headers llama.cpp needs during build: sudo apt update sudo apt install -y \ build-essential \ cmake \ curl \ git \ glslc \ libvulkan-dev \ mesa-vulkan-drivers \ spirv-headers \ vulkan-tools Now check what Vulkan sees: vulkaninfo --summary For a working D700 setup you should see two RADV devices. They may be labelled as RADV TAHITI, AMD FirePro D700, or similar depending on Mesa and kernel versions. Expected shape, not exact text: Devices: GPU0: RADV TAHITI / AMD FirePro D700 GPU1: RADV TAHITI / AMD FirePro D700 If vulkaninfo sees one card, fix that before building llama.cpp. llama.cpp can only use devices exposed by the Vulkan loader. Step 3: install llama.cpp with Vulkan You have two good options here. Start with the prebuilt Vulkan release unless you specifically need a local patch, a known commit, or a custom compiler setup. Option A: download the prebuilt Vulkan binary llama.cpp publishes release builds on GitHub, including an Ubuntu x64 Vulkan package. Download the latest one from the releases page: https://github.com/ggml-org/llama.cpp/releases Look for: php Linux - Ubuntu x64 Vulkan On the machine itself, you can fetch the newest Ubuntu x64 Vulkan tarball with the GitHub API: mkdir -p /opt/llama.cpp cd /opt/llama.cpp release url=$ curl -fsSL https://api.github.com/repos/ggml-org/llama.cpp/releases/latest | grep "browser download url" | grep "ubuntu-vulkan-x64.tar.gz" | cut -d '"' -f 4 curl -L "$release url" -o llama-vulkan.tar.gz tar -xzf llama-vulkan.tar.gz The extracted archive contains the runnable binaries. Depending on the release layout, they may be directly under the extracted directory rather than under build/bin. Confirm where llama-server landed: find /opt/llama.cpp -type f -name "llama-server" -print Use that path in the systemd unit below. If it prints /opt/llama.cpp/build/bin/llama-server, the later examples can be used unchanged. Option B: build from source Build from source when you want a specific commit or want to prove exactly which backend options are compiled in: git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build \ -DGGML VULKAN=ON \ -DLLAMA CURL=ON \ -DCMAKE BUILD TYPE=Release cmake --build build --config Release -j"$ nproc " Confirm the binary can see backend devices: ./build/bin/llama-server --list-devices If your llama.cpp build is older and does not expose --list-devices, use a short llama-cli smoke test and read the startup log for ggml vulkan. Step 4: run for full offload The default D700 command should be something like: GGML VK VISIBLE DEVICES=0,1 \ RADV PERFTEST=aco,gpl \ ./build/bin/llama-server \ --model /models/qwen2.5-7b-instruct-q4 k m.gguf \ --n-gpu-layers 99 \ --split-mode layer \ --threads 2 \ --parallel 1 \ --host 0.0.0.0 \ --port 8088 One thing worth noting if you are new to llama cpp is the --model option. If you omit this then it'll now start in router mode where it attempts to make available any models you have locally, when you first try to use one via the web ui it'll load it into memory and get it ready. However, if you are using a CLI harness like Pi, this doesn't know to tell the server to unload the model when you switch to a new one and will probably crash the server. To avoid that you can add the --models-max 1 The two settings that look optional but are not: Setting Why it matters GGML VK VISIBLE DEVICES=0,1 Keeps both D700s visible to llama.cpp --split-mode layer Lets llama.cpp distribute transformer layers across the two GPUs --threads 2 Avoids wasting CPU on sync-heavy Vulkan submission RADV PERFTEST=aco,gpl Uses RADV's faster shader compiler and pipeline path Do not blindly set --threads to the number of Xeon threads. Once all layers are on the GPUs, extra CPU threads mostly wait on Vulkan synchronization. On this machine, high thread counts can make the desktop feel broken without improving tokens per second. Step 5: make it a service Create a dedicated model directory and service user if you want this machine to be an always-on endpoint. Then create: sudoedit /etc/systemd/system/llama-server.service Unit Description=llama.cpp Vulkan inference server After=network-online.target Wants=network-online.target Service Type=simple User=llama WorkingDirectory=/opt/llama.cpp Environment="GGML VK VISIBLE DEVICES=0,1" Environment="RADV PERFTEST=aco,gpl" ExecStart=/opt/llama.cpp/build/bin/llama-server \ --model /srv/models/qwen2.5-7b-instruct-q4 k m.gguf \ --n-gpu-layers 99 \ --split-mode layer \ --threads 2 \ --parallel 1 \ --host 0.0.0.0 \ --port 8080 Restart=on-failure RestartSec=5 Install WantedBy=multi-user.target Remember to omit the --model option if you want it to run in router mode Enable it: sudo systemctl daemon-reload sudo systemctl enable --now llama-server sudo systemctl status llama-server Check the HTTP endpoint: curl http://localhost:8080/health Then confirm VRAM is actually being used on both cards: for card in /sys/class/drm/card /device/mem info vram used; do printf "%s: " "$card" awk '{ printf "%.1f MiB\n", $1 / 1024 / 1024 }' "$card" done The exact numbers depend on the model, but both D700s should move substantially above idle after the model loads. Cooling matters The Mac Pro 6,1 has one thermal core and one fan. That design is elegant until both GPUs sit under sustained compute load. Install macfanctld and make the fan curve less timid: sudo apt install -y macfanctld sudoedit /etc/macfanctl.conf A reasonable starting point: fan min: 1200 temp avg floor: 45 temp avg ceiling: 58 log level: 1 Restart and watch the log: sudo systemctl restart macfanctld sudo tail -f /var/log/macfanctl.log Under sustained inference, you want stable temperatures, not silence. The D700s have more memory headroom than the D300s, but they also put more heat into the same small chassis. Things to avoid Flash attention Do not assume --flash-attn helps. GCN 1.0 predates the FP16 throughput assumptions that make flash attention compelling on modern hardware. Test it if you want, but make the default "off" until benchmarks prove otherwise. Baseline first ./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2 Only then compare ./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2 --flash-attn Partial offload Avoid half-on-GPU, half-on-CPU configurations for models that exceed VRAM: Prefer this when it fits --n-gpu-layers 99 Prefer this when it does not fit --n-gpu-layers 0 Be suspicious of this on the Mac Pro 6,1 --n-gpu-layers 20 The D700 cards are connected through an old workstation design, not a modern high-bandwidth multi-GPU fabric. Once inference has to bounce across CPU and GPU layers, the bus and synchronization overhead can erase the benefit of acceleration. Giant context windows The D700 memory budget looks generous until you increase context. KV cache grows with context size, layer count, embedding size, and cache precision. VRAM pressure = model weights + compute buffers + KV cache KV cache roughly grows with: context length x number of layers x hidden size x cache precision Start at --ctx-size 4096. Move to 8192 only after watching VRAM on both cards during real prompts. You can alternatively just remove this option and allow llama cpp to decide for you, it'll pick the maximum it can fit in what VRAM is left over from loading the model. Benchmarking Stop the service before benchmarking: sudo systemctl stop llama-server Confirm the cards are back near idle: cat /sys/class/drm/card /device/mem info vram used Then benchmark one variable at a time: GGML VK VISIBLE DEVICES=0,1 RADV PERFTEST=aco,gpl \ ./build/bin/llama-bench \ -m /srv/models/qwen2.5-7b-instruct-q4 k m.gguf \ -ngl 99 \ -t 2 \ -c 4096 load backend: loaded RPC backend from /home/altitudelabs/llama-b9305/libggml-rpc.so ggml vulkan: Found 2 Vulkan devices: ggml vulkan: 0 = AMD Radeon R9 200 / HD 7900 Series RADV TAHITI radv | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none ggml vulkan: 1 = AMD Radeon R9 200 / HD 7900 Series RADV TAHITI radv | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none load backend: loaded Vulkan backend from /home/altitudelabs/llama-b9305/libggml-vulkan.so load backend: loaded CPU backend from /home/altitudelabs/llama-b9305/libggml-cpu-ivybridge.so Downloading Qwopus3.5-9B-Coder-MTP-Q4 K M.gguf ───────────────────── 100% | model | size | params | backend | ngl | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: | | qwen35 9B Q4 K - Medium | 5.37 GiB | 9.20 B | Vulkan | 99 | 2 | pp512 | 40.11 ± 0.30 | | qwen35 9B Q4 K - Medium | 5.37 GiB | 9.20 B | Vulkan | 99 | 2 | tg128 | 18.85 ± 0.02 | Record: Run Model Quant Context Threads Flash attention Decode tok/s 1 Qwopus3.5-9B-Coder-MTP Q4 K M 4096 2 off 18.85 2 Qwen3.5-9B-MTP Q4 K XL 4096 2 off 9.17 3 Qwen3.5-9B-MTP Q4 K M 4096 2 off 19.04 4 Qwen2.5-Coder-7B-Instruct Q4 K M 4096 2 off 21.39 Do not compare llama-bench directly to llama-server under real API traffic. The server has slot management, sampling, tokenization, and HTTP overhead. Use bench numbers to compare configurations, not to see production throughput. The use case for these machines The D700 Mac Pro is not a cheap alternative to a H100 and it is not a modern gaming GPU box although, it can actually run very well not Vulkan is enabled . Its still useful though, despite it being a bit power hungry compared to modern options: Use case Fit Local coding assistant fallback Good with a 7B Q4/Q5 model Private summarization endpoint Good with conservative context Multi-user chat service Poor 13B+ experimentation CPU-only or use newer hardware Always-on home lab inference Good if power cost is acceptable The point of the D700 is not that it wins benchmarks. It is that a sunk-cost workstation can still be a reliable local inference endpoint when the model is sized correctly and the Vulkan path is configured well. One this worth thinking about however is the running costs, these old machines can suck up 250-300w under full load, so if you are doing full time inference on them it might actually be cheaper to get a Codex / Claude subscription. You do the math and do whats best for you. Chief Technology Officer writing about AI systems, software architecture, cyber security, cryptography, and the practical realities of technology leadership.