{"slug": "rtx-5080-and-rtx-3090-setup-80-tok-s-on-qwen-3-6-27b-q8", "title": "RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8", "summary": "A user combined an RTX 5080 and RTX 3090 on an Asus Prime X570-Pro motherboard to run Qwen 3.6 27B Q8 at over 80 tokens per second. The setup required disabling CSM, enabling Above 4G Decoding and ReSize BAR, and using the nvidia-open driver due to different GPU generations. This demonstrates a cost-effective approach to high-performance local LLM inference using heterogeneous NVIDIA GPUs.", "body_md": "# RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8\n\nA year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups.\n\nFast forward 2026, Qwen 3.5, Gemma, Qwen 3.6, I needed more than 16GB. So I got myself a refurbished RTX 3090 with 24GB. I could then run Qwen 3.6 Q4 quants, first at ~30 tok/s, then 50-60 with MTP. Not bad. But still felt limited while my 5080 was barely used.\n\nSo I began digging what kind of setup could take profit of those 2 cards together. I already had DDR4 sticks and SSD disks ready, I only needed a mobo capable of handling the two cards.\n\nEnters the Asus Prime X570-Pro, the “Pro” is important, it is what ensures the 16x PCIe can be splitted in 2x8.\n\nThe 5080 being the monster it is I bought a good quality PCIe 4 riser to plug it on the second slot.\n\n## BIOS\n\nThe BIOS part was more complex than I anticipated. First and foremost: you CAN’T boot the OS in BIOS/MBR mode, this will forbid the use of both cards and implies kernel parameters unnecessary trickery even for one of them.\n\nThe parameters that should be set:\n\n- Go to the Boot tab and set CSM (Compatibility Support Module) to Disabled\n- Go to the Advanced tab -> PCI Subsystem Settings\n- Set Above 4G Decoding to Enabled\n- Set ReSize BAR Support to Auto or Enabled.\n- Still on the Advanced tab -> PCIEX16_1 Link Mode: Gen 4\n- PCIEX16_2 Link Mode: Gen 4\n\n## kernel\n\nNVidia documentation is a mess, here’s the link to driver’s installation procedure, yes, with `/tesla`\n\nin the URL, because why not: [https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html](https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html)\n\nThe two GPUs being different models, I unfortunately can’t setup this beauty [https://github.com/aikitoria/open-gpu-kernel-modules](https://github.com/aikitoria/open-gpu-kernel-modules)\nI tested it, the feature is enabled, but it was clear from the start it will likely to fail with different GPUs, moreover different generations.\n\nNevertheless for the lucky readers owning 2 cards of the same type, once the patched driver is built / installed, don’t forget to:\n\n- Uninstall\n`nvidia-dkms-open`\n\n**blacklist** the new`nova`\n\ndriver\n\nOnly then the freshly patched driver will load at boot. You should see the following:\n\n``` bash\n$ nvidia-smi topo -p2p r\n \tGPU0\tGPU1\t\n GPU0\tX\tOK\t\n GPU1\tOK\tX\t\n\nLegend:\n\n  X    = Self\n  OK   = Status Ok\n  CNS  = Chipset not supported\n  GNS  = GPU not supported\n  TNS  = Topology not supported\n  NS   = Not supported\n  DR   = Disabled by regkey\n  U    = Unknown\n```\n\nIf like me you own different NVidia cards, just use the `nvidia-open`\n\ndriver.\n\nOnce rebooted with the `nvidia`\n\ndriver loaded, check that the cards are well seen by it:\n\n``` bash\n$ nvidia-smi \nSat Jun 13 09:29:23 2026       \n+-----------------------------------------------------------------------------------------+\n| NVIDIA-SMI 610.43.02              KMD Version: 610.43.02     CUDA UMD Version: 13.3     |\n+-----------------------------------------+------------------------+----------------------+\n| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |\n|                                         |                        |               MIG M. |\n|=========================================+========================+======================|\n|   0  NVIDIA GeForce RTX 3090        On  |   00000000:07:00.0  On |                  N/A |\n|  0%   34C    P8             17W /  350W |   23646MiB /  24576MiB |      0%      Default |\n|                                         |                        |                  N/A |\n+-----------------------------------------+------------------------+----------------------+\n|   1  NVIDIA GeForce RTX 5080        On  |   00000000:08:00.0 Off |                  N/A |\n|  0%   31C    P8             15W /  360W |   15861MiB /  16303MiB |      0%      Default |\n|                                         |                        |                  N/A |\n+-----------------------------------------+------------------------+----------------------+\n\n+-----------------------------------------------------------------------------------------+\n| Processes:                                                                              |\n|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |\n|        ID   ID                                                               Usage      |\n|=========================================================================================|\n+-----------------------------------------------------------------------------------------+\n```\n\n## llama.cpp\n\nThose are the build flags I use to support both cards generation:\n\n```\n# cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES=\"86;120\" -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA_NCCL=OFF\n```\n\nThe relevant flag is `CMAKE_CUDA_ARCHITECTURES=\"86;120\"`\n\nwhich enables both *Ampere* and *Blackwell* architectures. Note the `-DGGML_CUDA_NCCL=OFF`\n\nflag, I found out `nccl`\n\nwas actually counter productive, even if `llama-server`\n\nlogs say otherwise.\n\nNow to startup options:\n\n```\nllama-server -m ./models/Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf \\\n    -c 229376 \\\n    -np 1 -fa on -ngl 99 -ub 512 -t 6 --no-mmap \\\n    --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \\\n    -ctk q8_0 -ctv q8_0 --kv-unified \\\n    --chat-template-kwargs {\"preserve_thinking\": true} \\\n    --spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 \\\n    -sm tensor -ts 2,3 \\\n    --port 8001 --host 0.0.0.0\n```\n\nThe sauce:\n\n[Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF)this model’s`q8`\n\nquantization fits in the overall 39GB with a`230k`\n\ncontext and KV-cache quant at`q8`\n\n!`--spec-type ngram-mod,draft-mtp --spec-draft-n-max 3`\n\nthe MTP speculative boost with a hint from`ngram`\n\n`-sm tensor`\n\nfrom[llama.cpp multi-GPUs documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md)`-ts 2,3`\n\ncards usage ratio, important to be able to fill up every VRAM corner!\n\n# Result\n\nWith this setup, I am able to run a full [Qwen3.6](https://github.com/QwenLM/Qwen3.6) model quantized at `q8`\n\n, at a whooping 80+ tokens/sec, depending on the task it can go as high as 90+.\n\n```\n2673.12.803.689 I slot create_check: id  0 | task 45808 | created context checkpoint 1 of 32 (pos_min = 12, pos_max = 12, n_tokens = 13, size = 149.677 MiB)\n2673.13.869.654 I reasoning-budget: deactivated (natural end)\n2673.14.095.592 I slot print_timing: id  0 | task 45808 | n_decoded =    100, tg =  81.84 t/s\n2673.17.131.165 I slot print_timing: id  0 | task 45808 | n_decoded =    388, tg =  91.13 t/s\n2673.18.058.712 I slot print_timing: id  0 | task 45808 | prompt eval time =     219.76 ms /    17 tokens (   12.93 ms per token,    77.36 tokens per second)\n2673.18.058.714 I slot print_timing: id  0 | task 45808 |        eval time =    5185.10 ms /   457 tokens (   11.35 ms per token,    88.14 tokens per second)\n2673.18.058.715 I slot print_timing: id  0 | task 45808 |       total time =    5404.85 ms /   474 tokens\n2673.18.058.716 I slot print_timing: id  0 | task 45808 |    graphs reused =      41669\n2673.18.058.717 I slot print_timing: id  0 | task 45808 | draft acceptance = 0.77295 (  320 accepted /   414 generated)\n2673.18.058.728 I statistics        ngram-mod: #calls(b,g,a) =  341  43646   1169, #gen drafts =   1169, #acc drafts =  1169, #gen tokens =  74496, #acc tokens = 44050, dur(b,g,a) = 1403.794, 706.959, 134.904 ms\n2673.18.058.731 I statistics        draft-mtp: #calls(b,g,a) =  341  42477  42477, #gen drafts =  42477, #acc drafts = 36208, #gen tokens = 127431, #acc tokens = 86553, dur(b,g,a) = 0.158, 264947.885, 44.505 ms\n```\n\nWhile your cards are computing, check they are actually running at full speed with the following command:\n\n``` bash\n$ sudo lspci -vvv -s 07:00.0 | grep \"LnkSta:\"\n```\n\nFor each PCIe port, you should see:\n\n```\nLnkSta:\tSpeed 16GT/s, Width x8 (downgraded)\n```\n\nIf you’re running the workload on a 16x/2 split.", "url": "https://wpnews.pro/news/rtx-5080-and-rtx-3090-setup-80-tok-s-on-qwen-3-6-27b-q8", "canonical_source": "https://imil.net/blog/posts/2026/rtx-5080-+-rtx-3090-setup-80+-tok-s-on-qwen-3.6-27b-q8/", "published_at": "2026-06-13 09:55:32+00:00", "updated_at": "2026-06-13 10:20:07.192894+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "developer-tools"], "entities": ["NVIDIA", "RTX 5080", "RTX 3090", "Asus Prime X570-Pro", "Qwen 3.6", "CUDA"], "alternates": {"html": "https://wpnews.pro/news/rtx-5080-and-rtx-3090-setup-80-tok-s-on-qwen-3-6-27b-q8", "markdown": "https://wpnews.pro/news/rtx-5080-and-rtx-3090-setup-80-tok-s-on-qwen-3-6-27b-q8.md", "text": "https://wpnews.pro/news/rtx-5080-and-rtx-3090-setup-80-tok-s-on-qwen-3-6-27b-q8.txt", "jsonld": "https://wpnews.pro/news/rtx-5080-and-rtx-3090-setup-80-tok-s-on-qwen-3-6-27b-q8.jsonld"}}