A year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups.
Fast forward 2026, Qwen 3.5, Gemma, Qwen 3.6, I needed more than 16GB. So I got myself a refurbished RTX 3090 with 24GB. I could then run Qwen 3.6 Q4 quants, first at ~30 tok/s, then 50-60 with MTP. Not bad. But still felt limited while my 5080 was barely used.
So I began digging what kind of setup could take profit of those 2 cards together. I already had DDR4 sticks and SSD disks ready, I only needed a mobo capable of handling the two cards.
Enters the Asus Prime X570-Pro, the “Pro” is important, it is what ensures the 16x PCIe can be splitted in 2x8.
The 5080 being the monster it is I bought a good quality PCIe 4 riser to plug it on the second slot.
BIOS #
The BIOS part was more complex than I anticipated. First and foremost: you CAN’T boot the OS in BIOS/MBR mode, this will forbid the use of both cards and implies kernel parameters unnecessary trickery even for one of them.
The parameters that should be set:
- Go to the Boot tab and set CSM (Compatibility Support Module) to Disabled
- Go to the Advanced tab -> PCI Subsystem Settings
- Set Above 4G Decoding to Enabled
- Set ReSize BAR Support to Auto or Enabled.
- Still on the Advanced tab -> PCIEX16_1 Link Mode: Gen 4
- PCIEX16_2 Link Mode: Gen 4
kernel #
NVidia documentation is a mess, here’s the link to driver’s installation procedure, yes, with /tesla
in the URL, because why not: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html
The two GPUs being different models, I unfortunately can’t setup this beauty https://github.com/aikitoria/open-gpu-kernel-modules I tested it, the feature is enabled, but it was clear from the start it will likely to fail with different GPUs, moreover different generations.
Nevertheless for the lucky readers owning 2 cards of the same type, once the patched driver is built / installed, don’t forget to:
- Uninstall
nvidia-dkms-open
blacklist the newnova
driver
Only then the freshly patched driver will load at boot. You should see the following:
$ nvidia-smi topo -p2p r
GPU0 GPU1
GPU0 X OK
GPU1 OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
DR = Disabled by regkey
U = Unknown
If like me you own different NVidia cards, just use the nvidia-open
driver.
Once rebooted with the nvidia
driver loaded, check that the cards are well seen by it:
$ nvidia-smi
Sat Jun 13 09:29:23 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:07:00.0 On | N/A |
| 0% 34C P8 17W / 350W | 23646MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5080 On | 00000000:08:00.0 Off | N/A |
| 0% 31C P8 15W / 360W | 15861MiB / 16303MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
llama.cpp #
Those are the build flags I use to support both cards generation:
The relevant flag is CMAKE_CUDA_ARCHITECTURES="86;120"
which enables both Ampere and Blackwell architectures. Note the -DGGML_CUDA_NCCL=OFF
flag, I found out nccl
was actually counter productive, even if llama-server
logs say otherwise.
Now to startup options:
llama-server -m ./models/Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf \
-c 229376 \
-np 1 -fa on -ngl 99 -ub 512 -t 6 --no-mmap \
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \
-ctk q8_0 -ctv q8_0 --kv-unified \
--chat-template-kwargs {"preserve_thinking": true} \
--spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 \
-sm tensor -ts 2,3 \
--port 8001 --host 0.0.0.0
The sauce:
Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.ggufthis model’sq8
quantization fits in the overall 39GB with a230k
context and KV-cache quant atq8
!--spec-type ngram-mod,draft-mtp --spec-draft-n-max 3
the MTP speculative boost with a hint fromngram
-sm tensor
fromllama.cpp multi-GPUs documentation-ts 2,3
cards usage ratio, important to be able to fill up every VRAM corner!
With this setup, I am able to run a full Qwen3.6 model quantized at q8
, at a whooping 80+ tokens/sec, depending on the task it can go as high as 90+.
2673.12.803.689 I slot create_check: id 0 | task 45808 | created context checkpoint 1 of 32 (pos_min = 12, pos_max = 12, n_tokens = 13, size = 149.677 MiB)
2673.13.869.654 I reasoning-budget: deactivated (natural end)
2673.14.095.592 I slot print_timing: id 0 | task 45808 | n_decoded = 100, tg = 81.84 t/s
2673.17.131.165 I slot print_timing: id 0 | task 45808 | n_decoded = 388, tg = 91.13 t/s
2673.18.058.712 I slot print_timing: id 0 | task 45808 | prompt eval time = 219.76 ms / 17 tokens ( 12.93 ms per token, 77.36 tokens per second)
2673.18.058.714 I slot print_timing: id 0 | task 45808 | eval time = 5185.10 ms / 457 tokens ( 11.35 ms per token, 88.14 tokens per second)
2673.18.058.715 I slot print_timing: id 0 | task 45808 | total time = 5404.85 ms / 474 tokens
2673.18.058.716 I slot print_timing: id 0 | task 45808 | graphs reused = 41669
2673.18.058.717 I slot print_timing: id 0 | task 45808 | draft acceptance = 0.77295 ( 320 accepted / 414 generated)
2673.18.058.728 I statistics ngram-mod: #calls(b,g,a) = 341 43646 1169, #gen drafts = 1169, #acc drafts = 1169, #gen tokens = 74496, #acc tokens = 44050, dur(b,g,a) = 1403.794, 706.959, 134.904 ms
2673.18.058.731 I statistics draft-mtp: #calls(b,g,a) = 341 42477 42477, #gen drafts = 42477, #acc drafts = 36208, #gen tokens = 127431, #acc tokens = 86553, dur(b,g,a) = 0.158, 264947.885, 44.505 ms
While your cards are computing, check they are actually running at full speed with the following command:
$ sudo lspci -vvv -s 07:00.0 | grep "LnkSta:"
For each PCIe port, you should see:
LnkSta: Speed 16GT/s, Width x8 (downgraded)
If you’re running the workload on a 16x/2 split.