cd /news/large-language-models/rtx-5080-and-rtx-3090-setup-80-tok-s… · home topics large-language-models article
[ARTICLE · art-26085] src=imil.net pub= topic=large-language-models verified=true sentiment=↑ positive

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

A user combined an RTX 5080 and RTX 3090 on an Asus Prime X570-Pro motherboard to run Qwen 3.6 27B Q8 at over 80 tokens per second. The setup required disabling CSM, enabling Above 4G Decoding and ReSize BAR, and using the nvidia-open driver due to different GPU generations. This demonstrates a cost-effective approach to high-performance local LLM inference using heterogeneous NVIDIA GPUs.

read6 min publishedJun 13, 2026

A year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups.

Fast forward 2026, Qwen 3.5, Gemma, Qwen 3.6, I needed more than 16GB. So I got myself a refurbished RTX 3090 with 24GB. I could then run Qwen 3.6 Q4 quants, first at ~30 tok/s, then 50-60 with MTP. Not bad. But still felt limited while my 5080 was barely used.

So I began digging what kind of setup could take profit of those 2 cards together. I already had DDR4 sticks and SSD disks ready, I only needed a mobo capable of handling the two cards.

Enters the Asus Prime X570-Pro, the “Pro” is important, it is what ensures the 16x PCIe can be splitted in 2x8.

The 5080 being the monster it is I bought a good quality PCIe 4 riser to plug it on the second slot.

BIOS #

The BIOS part was more complex than I anticipated. First and foremost: you CAN’T boot the OS in BIOS/MBR mode, this will forbid the use of both cards and implies kernel parameters unnecessary trickery even for one of them.

The parameters that should be set:

  • Go to the Boot tab and set CSM (Compatibility Support Module) to Disabled
  • Go to the Advanced tab -> PCI Subsystem Settings
  • Set Above 4G Decoding to Enabled
  • Set ReSize BAR Support to Auto or Enabled.
  • Still on the Advanced tab -> PCIEX16_1 Link Mode: Gen 4
  • PCIEX16_2 Link Mode: Gen 4

kernel #

NVidia documentation is a mess, here’s the link to driver’s installation procedure, yes, with /tesla

in the URL, because why not: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html

The two GPUs being different models, I unfortunately can’t setup this beauty https://github.com/aikitoria/open-gpu-kernel-modules I tested it, the feature is enabled, but it was clear from the start it will likely to fail with different GPUs, moreover different generations.

Nevertheless for the lucky readers owning 2 cards of the same type, once the patched driver is built / installed, don’t forget to:

  • Uninstall nvidia-dkms-open

blacklist the newnova

driver

Only then the freshly patched driver will load at boot. You should see the following:

$ nvidia-smi topo -p2p r
 	GPU0	GPU1	
 GPU0	X	OK	
 GPU1	OK	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  DR   = Disabled by regkey
  U    = Unknown

If like me you own different NVidia cards, just use the nvidia-open

driver.

Once rebooted with the nvidia

driver loaded, check that the cards are well seen by it:

$ nvidia-smi 
Sat Jun 13 09:29:23 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 610.43.02              KMD Version: 610.43.02     CUDA UMD Version: 13.3     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:07:00.0  On |                  N/A |
|  0%   34C    P8             17W /  350W |   23646MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5080        On  |   00000000:08:00.0 Off |                  N/A |
|  0%   31C    P8             15W /  360W |   15861MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

llama.cpp #

Those are the build flags I use to support both cards generation:

The relevant flag is CMAKE_CUDA_ARCHITECTURES="86;120"

which enables both Ampere and Blackwell architectures. Note the -DGGML_CUDA_NCCL=OFF

flag, I found out nccl

was actually counter productive, even if llama-server

logs say otherwise.

Now to startup options:

llama-server -m ./models/Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf \
    -c 229376 \
    -np 1 -fa on -ngl 99 -ub 512 -t 6 --no-mmap \
    --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \
    -ctk q8_0 -ctv q8_0 --kv-unified \
    --chat-template-kwargs {"preserve_thinking": true} \
    --spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 \
    -sm tensor -ts 2,3 \
    --port 8001 --host 0.0.0.0

The sauce:

Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.ggufthis model’sq8

quantization fits in the overall 39GB with a230k

context and KV-cache quant atq8

!--spec-type ngram-mod,draft-mtp --spec-draft-n-max 3

the MTP speculative boost with a hint fromngram

-sm tensor

fromllama.cpp multi-GPUs documentation-ts 2,3

cards usage ratio, important to be able to fill up every VRAM corner!

With this setup, I am able to run a full Qwen3.6 model quantized at q8

, at a whooping 80+ tokens/sec, depending on the task it can go as high as 90+.

2673.12.803.689 I slot create_check: id  0 | task 45808 | created context checkpoint 1 of 32 (pos_min = 12, pos_max = 12, n_tokens = 13, size = 149.677 MiB)
2673.13.869.654 I reasoning-budget: deactivated (natural end)
2673.14.095.592 I slot print_timing: id  0 | task 45808 | n_decoded =    100, tg =  81.84 t/s
2673.17.131.165 I slot print_timing: id  0 | task 45808 | n_decoded =    388, tg =  91.13 t/s
2673.18.058.712 I slot print_timing: id  0 | task 45808 | prompt eval time =     219.76 ms /    17 tokens (   12.93 ms per token,    77.36 tokens per second)
2673.18.058.714 I slot print_timing: id  0 | task 45808 |        eval time =    5185.10 ms /   457 tokens (   11.35 ms per token,    88.14 tokens per second)
2673.18.058.715 I slot print_timing: id  0 | task 45808 |       total time =    5404.85 ms /   474 tokens
2673.18.058.716 I slot print_timing: id  0 | task 45808 |    graphs reused =      41669
2673.18.058.717 I slot print_timing: id  0 | task 45808 | draft acceptance = 0.77295 (  320 accepted /   414 generated)
2673.18.058.728 I statistics        ngram-mod: #calls(b,g,a) =  341  43646   1169, #gen drafts =   1169, #acc drafts =  1169, #gen tokens =  74496, #acc tokens = 44050, dur(b,g,a) = 1403.794, 706.959, 134.904 ms
2673.18.058.731 I statistics        draft-mtp: #calls(b,g,a) =  341  42477  42477, #gen drafts =  42477, #acc drafts = 36208, #gen tokens = 127431, #acc tokens = 86553, dur(b,g,a) = 0.158, 264947.885, 44.505 ms

While your cards are computing, check they are actually running at full speed with the following command:

$ sudo lspci -vvv -s 07:00.0 | grep "LnkSta:"

For each PCIe port, you should see:

LnkSta:	Speed 16GT/s, Width x8 (downgraded)

If you’re running the workload on a 16x/2 split.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/rtx-5080-and-rtx-309…] indexed:0 read:6min 2026-06-13 ·