RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

A user combined an RTX 5080 and RTX 3090 on an Asus Prime X570-Pro motherboard to run Qwen 3.6 27B Q8 at over 80 tokens per second. The setup required disabling CSM, enabling Above 4G Decoding and ReSize BAR, and using the nvidia-open driver due to different GPU generations. This demonstrates a cost-effective approach to high-performance local LLM inference using heterogeneous NVIDIA GPUs.

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8 A year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups. Fast forward 2026, Qwen 3.5, Gemma, Qwen 3.6, I needed more than 16GB. So I got myself a refurbished RTX 3090 with 24GB. I could then run Qwen 3.6 Q4 quants, first at ~30 tok/s, then 50-60 with MTP. Not bad. But still felt limited while my 5080 was barely used. So I began digging what kind of setup could take profit of those 2 cards together. I already had DDR4 sticks and SSD disks ready, I only needed a mobo capable of handling the two cards. Enters the Asus Prime X570-Pro, the “Pro” is important, it is what ensures the 16x PCIe can be splitted in 2x8. The 5080 being the monster it is I bought a good quality PCIe 4 riser to plug it on the second slot. BIOS The BIOS part was more complex than I anticipated. First and foremost: you CAN’T boot the OS in BIOS/MBR mode, this will forbid the use of both cards and implies kernel parameters unnecessary trickery even for one of them. The parameters that should be set: - Go to the Boot tab and set CSM Compatibility Support Module to Disabled - Go to the Advanced tab - PCI Subsystem Settings - Set Above 4G Decoding to Enabled - Set ReSize BAR Support to Auto or Enabled. - Still on the Advanced tab - PCIEX16 1 Link Mode: Gen 4 - PCIEX16 2 Link Mode: Gen 4 kernel NVidia documentation is a mess, here’s the link to driver’s installation procedure, yes, with /tesla in the URL, because why not: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html The two GPUs being different models, I unfortunately can’t setup this beauty https://github.com/aikitoria/open-gpu-kernel-modules https://github.com/aikitoria/open-gpu-kernel-modules I tested it, the feature is enabled, but it was clear from the start it will likely to fail with different GPUs, moreover different generations. Nevertheless for the lucky readers owning 2 cards of the same type, once the patched driver is built / installed, don’t forget to: - Uninstall nvidia-dkms-open blacklist the new nova driver Only then the freshly patched driver will load at boot. You should see the following: bash $ nvidia-smi topo -p2p r GPU0 GPU1 GPU0 X OK GPU1 OK X Legend: X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported DR = Disabled by regkey U = Unknown If like me you own different NVidia cards, just use the nvidia-open driver. Once rebooted with the nvidia driver loaded, check that the cards are well seen by it: bash $ nvidia-smi Sat Jun 13 09:29:23 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:07:00.0 On | N/A | | 0% 34C P8 17W / 350W | 23646MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 5080 On | 00000000:08:00.0 Off | N/A | | 0% 31C P8 15W / 360W | 15861MiB / 16303MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+ llama.cpp Those are the build flags I use to support both cards generation: cmake -B build -DBUILD SHARED LIBS=OFF -DGGML CUDA=ON -DGGML NATIVE=ON -DGGML CUDA FA=ON -DGGML CUDA FA ALL QUANTS=ON -DCMAKE CUDA ARCHITECTURES="86;120" -DCMAKE CUDA COMPILER=/usr/local/cuda/bin/nvcc -DGGML CUDA NCCL=OFF The relevant flag is CMAKE CUDA ARCHITECTURES="86;120" which enables both Ampere and Blackwell architectures. Note the -DGGML CUDA NCCL=OFF flag, I found out nccl was actually counter productive, even if llama-server logs say otherwise. Now to startup options: llama-server -m ./models/Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8 0.gguf \ -c 229376 \ -np 1 -fa on -ngl 99 -ub 512 -t 6 --no-mmap \ --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \ -ctk q8 0 -ctv q8 0 --kv-unified \ --chat-template-kwargs {"preserve thinking": true} \ --spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 \ -sm tensor -ts 2,3 \ --port 8001 --host 0.0.0.0 The sauce: Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8 0.gguf https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF this model’s q8 quantization fits in the overall 39GB with a 230k context and KV-cache quant at q8 --spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 the MTP speculative boost with a hint from ngram -sm tensor from llama.cpp multi-GPUs documentation https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md -ts 2,3 cards usage ratio, important to be able to fill up every VRAM corner Result With this setup, I am able to run a full Qwen3.6 https://github.com/QwenLM/Qwen3.6 model quantized at q8 , at a whooping 80+ tokens/sec, depending on the task it can go as high as 90+. 2673.12.803.689 I slot create check: id 0 | task 45808 | created context checkpoint 1 of 32 pos min = 12, pos max = 12, n tokens = 13, size = 149.677 MiB 2673.13.869.654 I reasoning-budget: deactivated natural end 2673.14.095.592 I slot print timing: id 0 | task 45808 | n decoded = 100, tg = 81.84 t/s 2673.17.131.165 I slot print timing: id 0 | task 45808 | n decoded = 388, tg = 91.13 t/s 2673.18.058.712 I slot print timing: id 0 | task 45808 | prompt eval time = 219.76 ms / 17 tokens 12.93 ms per token, 77.36 tokens per second 2673.18.058.714 I slot print timing: id 0 | task 45808 | eval time = 5185.10 ms / 457 tokens 11.35 ms per token, 88.14 tokens per second 2673.18.058.715 I slot print timing: id 0 | task 45808 | total time = 5404.85 ms / 474 tokens 2673.18.058.716 I slot print timing: id 0 | task 45808 | graphs reused = 41669 2673.18.058.717 I slot print timing: id 0 | task 45808 | draft acceptance = 0.77295 320 accepted / 414 generated 2673.18.058.728 I statistics ngram-mod: calls b,g,a = 341 43646 1169, gen drafts = 1169, acc drafts = 1169, gen tokens = 74496, acc tokens = 44050, dur b,g,a = 1403.794, 706.959, 134.904 ms 2673.18.058.731 I statistics draft-mtp: calls b,g,a = 341 42477 42477, gen drafts = 42477, acc drafts = 36208, gen tokens = 127431, acc tokens = 86553, dur b,g,a = 0.158, 264947.885, 44.505 ms While your cards are computing, check they are actually running at full speed with the following command: bash $ sudo lspci -vvv -s 07:00.0 | grep "LnkSta:" For each PCIe port, you should see: LnkSta: Speed 16GT/s, Width x8 downgraded If you’re running the workload on a 16x/2 split.