# RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

> Source: <https://imil.net/blog/posts/2026/rtx-5080-+-rtx-3090-setup-80+-tok-s-on-qwen-3.6-27b-q8/>
> Published: 2026-06-13 09:55:32+00:00

# RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8

A year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups.

Fast forward 2026, Qwen 3.5, Gemma, Qwen 3.6, I needed more than 16GB. So I got myself a refurbished RTX 3090 with 24GB. I could then run Qwen 3.6 Q4 quants, first at ~30 tok/s, then 50-60 with MTP. Not bad. But still felt limited while my 5080 was barely used.

So I began digging what kind of setup could take profit of those 2 cards together. I already had DDR4 sticks and SSD disks ready, I only needed a mobo capable of handling the two cards.

Enters the Asus Prime X570-Pro, the “Pro” is important, it is what ensures the 16x PCIe can be splitted in 2x8.

The 5080 being the monster it is I bought a good quality PCIe 4 riser to plug it on the second slot.

## BIOS

The BIOS part was more complex than I anticipated. First and foremost: you CAN’T boot the OS in BIOS/MBR mode, this will forbid the use of both cards and implies kernel parameters unnecessary trickery even for one of them.

The parameters that should be set:

- Go to the Boot tab and set CSM (Compatibility Support Module) to Disabled
- Go to the Advanced tab -> PCI Subsystem Settings
- Set Above 4G Decoding to Enabled
- Set ReSize BAR Support to Auto or Enabled.
- Still on the Advanced tab -> PCIEX16_1 Link Mode: Gen 4
- PCIEX16_2 Link Mode: Gen 4

## kernel

NVidia documentation is a mess, here’s the link to driver’s installation procedure, yes, with `/tesla`

in the URL, because why not: [https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html](https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html)

The two GPUs being different models, I unfortunately can’t setup this beauty [https://github.com/aikitoria/open-gpu-kernel-modules](https://github.com/aikitoria/open-gpu-kernel-modules)
I tested it, the feature is enabled, but it was clear from the start it will likely to fail with different GPUs, moreover different generations.

Nevertheless for the lucky readers owning 2 cards of the same type, once the patched driver is built / installed, don’t forget to:

- Uninstall
`nvidia-dkms-open`

**blacklist** the new`nova`

driver

Only then the freshly patched driver will load at boot. You should see the following:

``` bash
$ nvidia-smi topo -p2p r
 	GPU0	GPU1	
 GPU0	X	OK	
 GPU1	OK	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  DR   = Disabled by regkey
  U    = Unknown
```

If like me you own different NVidia cards, just use the `nvidia-open`

driver.

Once rebooted with the `nvidia`

driver loaded, check that the cards are well seen by it:

``` bash
$ nvidia-smi 
Sat Jun 13 09:29:23 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 610.43.02              KMD Version: 610.43.02     CUDA UMD Version: 13.3     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:07:00.0  On |                  N/A |
|  0%   34C    P8             17W /  350W |   23646MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5080        On  |   00000000:08:00.0 Off |                  N/A |
|  0%   31C    P8             15W /  360W |   15861MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
```

## llama.cpp

Those are the build flags I use to support both cards generation:

```
# cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES="86;120" -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA_NCCL=OFF
```

The relevant flag is `CMAKE_CUDA_ARCHITECTURES="86;120"`

which enables both *Ampere* and *Blackwell* architectures. Note the `-DGGML_CUDA_NCCL=OFF`

flag, I found out `nccl`

was actually counter productive, even if `llama-server`

logs say otherwise.

Now to startup options:

```
llama-server -m ./models/Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf \
    -c 229376 \
    -np 1 -fa on -ngl 99 -ub 512 -t 6 --no-mmap \
    --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \
    -ctk q8_0 -ctv q8_0 --kv-unified \
    --chat-template-kwargs {"preserve_thinking": true} \
    --spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 \
    -sm tensor -ts 2,3 \
    --port 8001 --host 0.0.0.0
```

The sauce:

[Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF)this model’s`q8`

quantization fits in the overall 39GB with a`230k`

context and KV-cache quant at`q8`

!`--spec-type ngram-mod,draft-mtp --spec-draft-n-max 3`

the MTP speculative boost with a hint from`ngram`

`-sm tensor`

from[llama.cpp multi-GPUs documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md)`-ts 2,3`

cards usage ratio, important to be able to fill up every VRAM corner!

# Result

With this setup, I am able to run a full [Qwen3.6](https://github.com/QwenLM/Qwen3.6) model quantized at `q8`

, at a whooping 80+ tokens/sec, depending on the task it can go as high as 90+.

```
2673.12.803.689 I slot create_check: id  0 | task 45808 | created context checkpoint 1 of 32 (pos_min = 12, pos_max = 12, n_tokens = 13, size = 149.677 MiB)
2673.13.869.654 I reasoning-budget: deactivated (natural end)
2673.14.095.592 I slot print_timing: id  0 | task 45808 | n_decoded =    100, tg =  81.84 t/s
2673.17.131.165 I slot print_timing: id  0 | task 45808 | n_decoded =    388, tg =  91.13 t/s
2673.18.058.712 I slot print_timing: id  0 | task 45808 | prompt eval time =     219.76 ms /    17 tokens (   12.93 ms per token,    77.36 tokens per second)
2673.18.058.714 I slot print_timing: id  0 | task 45808 |        eval time =    5185.10 ms /   457 tokens (   11.35 ms per token,    88.14 tokens per second)
2673.18.058.715 I slot print_timing: id  0 | task 45808 |       total time =    5404.85 ms /   474 tokens
2673.18.058.716 I slot print_timing: id  0 | task 45808 |    graphs reused =      41669
2673.18.058.717 I slot print_timing: id  0 | task 45808 | draft acceptance = 0.77295 (  320 accepted /   414 generated)
2673.18.058.728 I statistics        ngram-mod: #calls(b,g,a) =  341  43646   1169, #gen drafts =   1169, #acc drafts =  1169, #gen tokens =  74496, #acc tokens = 44050, dur(b,g,a) = 1403.794, 706.959, 134.904 ms
2673.18.058.731 I statistics        draft-mtp: #calls(b,g,a) =  341  42477  42477, #gen drafts =  42477, #acc drafts = 36208, #gen tokens = 127431, #acc tokens = 86553, dur(b,g,a) = 0.158, 264947.885, 44.505 ms
```

While your cards are computing, check they are actually running at full speed with the following command:

``` bash
$ sudo lspci -vvv -s 07:00.0 | grep "LnkSta:"
```

For each PCIe port, you should see:

```
LnkSta:	Speed 16GT/s, Width x8 (downgraded)
```

If you’re running the workload on a 16x/2 split.
