Run Llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu

wpnews.pro

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu

A D700-specific guide to running llama.cpp with Vulkan on the 2013 Mac Pro: dual 6 GB FirePro cards, Ubuntu, RADV, full GPU offload, cooling, and the traps that make old GCN hardware look slower than it is.

May 26, 202612 min read

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu

The 2013 Mac Pro is still a strange machine: thermally dense, beautifully overbuilt, and awkwardly dependent on two workstation GPUs that most modern ML stacks have forgotten. The D700 version is the most interesting one for local LLM work because it gives you dual AMD FirePro D700 cards with 6 GB of GDDR5 each.

That is 12 GB of aggregate VRAM, but it is not a single 12 GB GPU. Treat it as two separate 6 GB pools that llama.cpp can use well when the Vulkan backend is configured correctly.

Mac Pro 6,1 D700 memory shape

             llama.cpp Vulkan backend
                       |
              split-mode: layer
                       |
        +--------------+--------------+
        |                             |
  FirePro D700 0                 FirePro D700 1
  Tahiti / GCN 1.0               Tahiti / GCN 1.0
  6 GB GDDR5                     6 GB GDDR5

The practical outcome is simple: the D700 machine can comfortably run the class of models that are annoying on a D300. Seven billion parameter Q4 models become realistic with useful context sizes. Thirteen billion parameter models are still a poor fit if you expect full GPU offload, because the Mac Pro's dual cards do not behave like one contiguous accelerator.

This guide is a D700-specific rewrite of Edward Chalupa's excellent D300 guide. The main flow is the same: Ubuntu, the amdgpu kernel driver, Mesa RADV, llama.cpp built with Vulkan, and a few settings that matter much more than they look.

Hardware target

Apple shipped three GPU tiers in the Mac Pro 6,1. The D700 is the top configuration: each card has 6 GB of GDDR5, 2048 stream processors, a 384-bit memory bus, and 264 GB/s of memory bandwidth.

GPU

Architecture family

VRAM per card

Aggregate VRAM

Practical llama.cpp target

FirePro D300

GCN 1.0 / Pitcairn-class

2 GB

4 GB

3B and small 4B models

FirePro D500

GCN 1.0 / Tahiti-class

3 GB

6 GB

4B and some compact 7B quants

FirePro D700

GCN 1.0 / Tahiti-class

6 GB

12 GB

7B Q4/Q5, sometimes 8B Q4

The important difference is not raw TFLOPS. It is memory headroom. A 7B Q4_K_M GGUF is usually around 4.0-4.5 GB before runtime buffers and KV cache. On a D300 that is a non-starter. On a D700 pair, layer splitting gives the model enough room.

What fits

Use these as planning numbers, not promises. Exact memory depends on architecture, quantization, context size, batch settings, and llama.cpp version.

Model class

Quant

Typical GGUF size

D700 verdict

3B

Q8_0

~3.0-3.5 GB

Easy, but underuses the hardware

7B

Q4_K_M

~4.0-4.5 GB

Good default target

7B

Q5_K_M

~5.0-5.5 GB

Good with conservative context

8B

Q4_K_M

~4.5-5.0 GB

Usually workable

13B

Q4_K_M

~7.5-8.5 GB

Usually not worth it on this bus

The trap is reading "12 GB VRAM" as "anything under 12 GB fits." It does not. llama.cpp can distribute layers across devices, but each card still has a 6 GB ceiling and the runtime needs additional memory for compute buffers and KV cache.

Why a 13B Q4 model is awkward

  Model weights + buffers + KV cache
  +----------------------------------+
  | more than one D700 can hold well |
  +----------------------------------+

  Splitting helps with layers, but the old PCIe path and sync cost
  make CPU/GPU mixed inference unattractive once full offload fails.

For this machine, optimize for models that fully offload. If the model does not fit with --n-gpu-layers 99, the fallback should usually be CPU-only, not partial offload.

The driver stack

The D700 is old GCN hardware. The old radeon kernel driver can drive displays, but it is the wrong foundation for Vulkan inference. You want this stack:

llama-server
  |
  |  GGML Vulkan backend
  v
Mesa RADV Vulkan driver
  |
  |  userspace Vulkan implementation
  v
Linux amdgpu kernel driver
  |
  v
Dual FirePro D700 GPUs

Mesa documents RADV as the Vulkan driver for AMD GCN/RDNA GPUs, with the caveat that GCN 1-2 hardware may need amdgpu explicitly enabled instead of radeon. Ubuntu 24.04 often does the right thing on this Mac Pro, but you should verify rather than assume.

Step 1: verify both GPUs use amdgpu

Start with PCI detection:

lspci -nnk | grep -A3 -E "VGA|Display|FirePro|AMD"

You want both D700 devices to report:

Kernel driver in use: amdgpu

If either card is bound to radeon, add the Southern Islands amdgpu flags:

sudoedit /etc/default/grub

Set or extend GRUB_CMDLINE_LINUX_DEFAULT:

radeon.si_support=0 amdgpu.si_support=1

Then update GRUB and reboot:

sudo update-grub
sudo reboot

After reboot, check again. Do not continue until both cards are on amdgpu.

Step 2: install and test Vulkan

Install the Vulkan userspace pieces and the headers llama.cpp needs during build:

sudo apt update
sudo apt install -y \
  build-essential \
  cmake \
  curl \
  git \
  glslc \
  libvulkan-dev \
  mesa-vulkan-drivers \
  spirv-headers \
  vulkan-tools

Now check what Vulkan sees:

vulkaninfo --summary

For a working D700 setup you should see two RADV devices. They may be labelled as RADV TAHITI, AMD FirePro D700, or similar depending on Mesa and kernel versions.

Expected shape, not exact text:

Devices:
  GPU0: RADV TAHITI / AMD FirePro D700
  GPU1: RADV TAHITI / AMD FirePro D700

If vulkaninfo sees one card, fix that before building llama.cpp. llama.cpp can only use devices exposed by the Vulkan .

Step 3: install llama.cpp with Vulkan

You have two good options here. Start with the prebuilt Vulkan release unless you specifically need a local patch, a known commit, or a custom compiler setup.

Option A: download the prebuilt Vulkan binary

llama.cpp publishes release builds on GitHub, including an Ubuntu x64 Vulkan package. Download the latest one from the releases page:

https://github.com/ggml-org/llama.cpp/releases

Look for:

Linux -> Ubuntu x64 (Vulkan)

On the machine itself, you can fetch the newest Ubuntu x64 Vulkan tarball with the GitHub API:

mkdir -p /opt/llama.cpp
cd /opt/llama.cpp

release_url=$(
  curl -fsSL https://api.github.com/repos/ggml-org/llama.cpp/releases/latest |
    grep "browser_download_url" |
    grep "ubuntu-vulkan-x64.tar.gz" |
    cut -d '"' -f 4
)

curl -L "$release_url" -o llama-vulkan.tar.gz
tar -xzf llama-vulkan.tar.gz

The extracted archive contains the runnable binaries. Depending on the release layout, they may be directly under the extracted directory rather than under build/bin. Confirm where llama-server landed:

find /opt/llama.cpp -type f -name "llama-server" -print

Use that path in the systemd unit below. If it prints /opt/llama.cpp/build/bin/llama-server, the later examples can be used unchanged.

Option B: build from source

Build from source when you want a specific commit or want to prove exactly which backend options are compiled in:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build \
  -DGGML_VULKAN=ON \
  -DLLAMA_CURL=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j"$(nproc)"

Confirm the binary can see backend devices:

./build/bin/llama-server --list-devices

If your llama.cpp build is older and does not expose --list-devices, use a short llama-cli smoke test and read the startup log for ggml_vulkan.

Step 4: run for full offload

The default D700 command should be something like:

GGML_VK_VISIBLE_DEVICES=0,1 \
RADV_PERFTEST=aco,gpl \
./build/bin/llama-server \
  --model /models/qwen2.5-7b-instruct-q4_k_m.gguf \
  --n-gpu-layers 99 \
  --split-mode layer \
  --threads 2 \
  --parallel 1 \
  --host 0.0.0.0 \
  --port 8088

One thing worth noting if you are new to llama cpp is the --model option. If you omit this then it'll now start in router mode where it attempts to make available any models you have locally, when you first try to use one via the web ui it'll load it into memory and get it ready. However, if you are using a CLI harness like Pi, this doesn't know to tell the server to unload the model when you switch to a new one and will probably crash the server. To avoid that you can add the --models-max 1

The two settings that look optional but are not:

Setting

Why it matters

GGML_VK_VISIBLE_DEVICES=0,1

Keeps both D700s visible to llama.cpp

--split-mode layer

Lets llama.cpp distribute transformer layers across the two GPUs

--threads 2

Avoids wasting CPU on sync-heavy Vulkan submission

RADV_PERFTEST=aco,gpl

Uses RADV's faster shader compiler and pipeline path

Do not blindly set --threads to the number of Xeon threads. Once all layers are on the GPUs, extra CPU threads mostly wait on Vulkan synchronization. On this machine, high thread counts can make the desktop feel broken without improving tokens per second.

Step 5: make it a service

Create a dedicated model directory and service user if you want this machine to be an always-on endpoint. Then create:

sudoedit /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp Vulkan inference server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=llama
WorkingDirectory=/opt/llama.cpp
Environment="GGML_VK_VISIBLE_DEVICES=0,1"
Environment="RADV_PERFTEST=aco,gpl"
ExecStart=/opt/llama.cpp/build/bin/llama-server \
  --model /srv/models/qwen2.5-7b-instruct-q4_k_m.gguf \
  --n-gpu-layers 99 \
  --split-mode layer \
  --threads 2 \
  --parallel 1 \
  --host 0.0.0.0 \
  --port 8080
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Remember to omit the --model option if you want it to run in router mode

Enable it:

sudo systemctl daemon-reload
sudo systemctl enable --now llama-server
sudo systemctl status llama-server

Check the HTTP endpoint:

curl http://localhost:8080/health

Then confirm VRAM is actually being used on both cards:

for card in /sys/class/drm/card*/device/mem_info_vram_used; do
  printf "%s: " "$card"
  awk '{ printf "%.1f MiB\n", $1 / 1024 / 1024 }' "$card"
done

The exact numbers depend on the model, but both D700s should move substantially above idle after the model loads.

Cooling matters

The Mac Pro 6,1 has one thermal core and one fan. That design is elegant until both GPUs sit under sustained compute load. Install macfanctld and make the fan curve less timid:

sudo apt install -y macfanctld
sudoedit /etc/macfanctl.conf

A reasonable starting point:

fan_min: 1200
temp_avg_floor: 45
temp_avg_ceiling: 58
log_level: 1

Restart and watch the log:

sudo systemctl restart macfanctld
sudo tail -f /var/log/macfanctl.log

Under sustained inference, you want stable temperatures, not silence. The D700s have more memory headroom than the D300s, but they also put more heat into the same small chassis.

Things to avoid

Flash attention

Do not assume --flash-attn helps. GCN 1.0 predates the FP16 throughput assumptions that make flash attention compelling on modern hardware. Test it if you want, but make the default "off" until benchmarks prove otherwise.

./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2

./build/bin/llama-bench -m /srv/models/model.gguf -ngl 99 -t 2 --flash-attn

Partial offload

Avoid half-on-GPU, half-on-CPU configurations for models that exceed VRAM:

--n-gpu-layers 99

--n-gpu-layers 0

--n-gpu-layers 20

The D700 cards are connected through an old workstation design, not a modern high-bandwidth multi-GPU fabric. Once inference has to bounce across CPU and GPU layers, the bus and synchronization overhead can erase the benefit of acceleration.

Giant context windows

The D700 memory budget looks generous until you increase context. KV cache grows with context size, layer count, embedding size, and cache precision.

VRAM pressure = model weights + compute buffers + KV cache

KV cache roughly grows with:
  context length x number of layers x hidden size x cache precision

Start at --ctx-size 4096. Move to 8192 only after watching VRAM on both cards during real prompts. You can alternatively just remove this option and allow llama cpp to decide for you, it'll pick the maximum it can fit in what VRAM is left over from the model.

Benchmarking

Stop the service before benchmarking:

sudo systemctl stop llama-server

Confirm the cards are back near idle:

cat /sys/class/drm/card*/device/mem_info_vram_used

Then benchmark one variable at a time:

GGML_VK_VISIBLE_DEVICES=0,1 RADV_PERFTEST=aco,gpl \
./build/bin/llama-bench \
  -m /srv/models/qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 99 \
  -t 2 \
  -c 4096
load_backend: loaded RPC backend from /home/altitudelabs/llama-b9305/libggml-rpc.so
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon R9 200 / HD 7900 Series (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon R9 200 / HD 7900 Series (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/altitudelabs/llama-b9305/libggml-vulkan.so
load_backend: loaded CPU backend from /home/altitudelabs/llama-b9305/libggml-cpu-ivybridge.so
Down Qwopus3.5-9B-Coder-MTP-Q4_K_M.gguf ───────────────────── 100%
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium        |   5.37 GiB |     9.20 B | Vulkan     |  99 |       2 |           pp512 |         40.11 ± 0.30 |
| qwen35 9B Q4_K - Medium        |   5.37 GiB |     9.20 B | Vulkan     |  99 |       2 |           tg128 |         18.85 ± 0.02 |

Record:

Run

Model

Quant

Context

Threads

Flash attention

Decode tok/s

1

Qwopus3.5-9B-Coder-MTP

Q4_K_M

4096

2

off

18.85

2

Qwen3.5-9B-MTP

Q4_K_XL

4096

2

off

9.17

3

Qwen3.5-9B-MTP

Q4_K_M

4096

2

off

19.04

4

Qwen2.5-Coder-7B-Instruct

Q4_K_M

4096

2

off

21.39

Do not compare llama-bench directly to llama-server under real API traffic. The server has slot management, sampling, tokenization, and HTTP overhead. Use bench numbers to compare configurations, not to see production throughput.

The use case for these machines

The D700 Mac Pro is not a cheap alternative to a H100 and it is not a modern gaming GPU box (although, it can actually run very well not Vulkan is enabled). Its still useful though, despite it being a bit power hungry compared to modern options:

Use case

Fit

Local coding assistant fallback

Good with a 7B Q4/Q5 model

Private summarization endpoint

Good with conservative context

Multi-user chat service

Poor

13B+ experimentation

CPU-only or use newer hardware

Always-on home lab inference

Good if power cost is acceptable

The point of the D700 is not that it wins benchmarks. It is that a sunk-cost workstation can still be a reliable local inference endpoint when the model is sized correctly and the Vulkan path is configured well.

One this worth thinking about however is the running costs, these old machines can suck up 250-300w under full load, so if you are doing full time inference on them it might actually be cheaper to get a Codex / Claude subscription. You do the math and do whats best for you.

Chief Technology Officer writing about AI systems, software architecture, cyber security, cryptography, and the practical realities of technology leadership.

source & further reading

matthewgribben.com — original article

Run Llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu

Run your AI side-project on zahid.host