Here's a fact that should stop every AI infrastructure engineer in their tracks: as of mid-2026, the de facto standard for serving a 671B DeepSeek-R1 model in production still requires 8x H100 GPUs and roughly $200,000 of hardware. Meanwhile, an open-source project from MADSys Lab at Tsinghua University has been quietly running 236B-parameter MoE models on a single workstation since 2024, and hit 286 tokens/s prefill on DeepSeek-R1 671B on commodity hardware. That project is kvcache-ai/ktransformers
, and as of 2026-06-12 it has 17,264 Stars, 1,313 Forks, and an Apache-2.0 license. The 2026 AI infrastructure conversation has been dominated by NVIDIA rack-scale systems and the ever-growing VRAM bill. KTransformers is the open-source counter-narrative: it lets you run frontier-class MoE models on a mix of consumer GPUs and CPU RAM, and it does this with five production-grade techniques that almost nobody talks about.
In 2026, Mixture-of-Experts (MoE) has become the default architecture for frontier open-weight models. DeepSeek-V3/R1, Qwen3-235B-A22B, Kimi-K2.5, GLM-4.7, and the new DeepSeek-V4-Flash are all MoE. The naive assumption is that MoE inference still needs H100-class GPUs because each token only activates a few experts, so the active parameter count is small, but the total parameter count is enormous (671B for DeepSeek-R1, 1T for Kimi-K2.5). The CPU-GPU hybrid approach moves the "cold" experts to CPU RAM and keeps the "hot" experts on the GPU. KTransformers has turned this idea into a production framework that supports nine different MoE models as of v0.6.2 (released 2026-05-03). The 2026 ACM SIGOPS paper "KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models" formally published the architecture.
What most people do: They treat the GPU as a black box and try to fit the entire MoE model into VRAM. When the model is too large, they either buy more GPUs or use a smaller model.
The hidden trick: KTransformers exposes four explicit expert placement strategies via the --kt-expert-placement-strategy
flag. The frequency
strategy records expert activation statistics, then places only the most frequently activated experts on the GPU while keeping cold experts in CPU RAM. You can also enable --kt-enable-dynamic-expert-update
to redistribute experts at runtime when the prefill token count exceeds a threshold.
python -m sglang.launch_server \
--model /path/to/qwen3-next-80b \
--kt-num-gpu-experts 8 \
--kt-expert-placement-strategy frequency \
--init-expert-location /path/to/activation_stats.pt
python -m sglang.launch_server \
--model /path/to/qwen3-next-80b \
--kt-num-gpu-experts 8 \
--kt-expert-placement-strategy frequency \
--init-expert-location /path/to/activation_stats.pt \
--kt-enable-dynamic-expert-update \
--kt-gpu-prefill-token-threshold 512
The result: On Qwen3-Next-80B-A3B-Instruct-FP8 with 4x RTX 4090 + Intel Xeon Gold 6454S, the official benchmark table shows that at a 50% GPU expert ratio, the frequency
strategy delivers 76.19 tokens/s, and dynamic-expert-update
pushes that to 81.17 tokens/s (versus 65.25 tokens/s for the default uniform
strategy). At 80% GPU ratio, the frequency strategy hits 100.67 tokens/s.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, 1,313 Forks, Apache-2.0, last push 2026-06-07, v0.6.2 released 2026-05-03; benchmark table from doc/en/kt-kernel/experts-sched-Tutorial.md
; HN "Show HN: KTransformers-236B Model and 1M Context LLM Inference" 20 points (story from 2024-08-29, 3 comments).
What most people do: They rebuild the KV cache from scratch for every request. For long-context workloads (a 100K token system prompt plus a 50K token conversation), this is a multi-minute cold start every single time.
The hidden trick: KTransformers' balance_serve
engine implements a 3-layer KV cache hierarchy. Hot prefixes live on the GPU, warm prefixes live in CPU RAM, and cold prefixes live on disk. The attn.page_size
and kvc2.cpu_memory_size_GB
parameters control the split. Once you enable it, repeated requests that share a system prompt only compute the KV cache for the delta, not the full context.
attn:
page_size: 16 # Size of a page in KV Cache
chunk_size: 256
kvc2:
gpu_only: false # false = Disk + CPU + GPU KV storage
utilization_percentage: 1.0
cpu_memory_size_GB: 500 # Amount of CPU memory allocated for KV Cache
disk_path: /mnt/data/kvc # Path to store KV Cache on disk
After editing the config, recompile with prefix cache mode enabled:
git submodule update --init --recursive
USE_BALANCE_SERVE=1 bash ./install.sh
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
The result: Multi-turn agent workflows and RAG pipelines with a stable system prompt reuse the cached prefix across thousands of requests. The CPU-GPU-Disk split means you can serve models whose total context window is far larger than GPU VRAM, with the disk layer acting as a transparent extension.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, Apache-2.0; configuration format from doc/en/prefix_cache.md
; release notes from doc/en/balance-serve.md
documenting v0.2.4 multi-concurrency architecture refactor.
What most people do: They run CPU matrix multiplications on AVX-512 instructions, which is the default in llama.cpp and most other inference stacks. On consumer CPUs, this caps MoE inference at 60-80 tokens/s.
The hidden trick: KTransformers v0.3+ ships native AMX (Intel Advanced Matrix Extensions) kernels for BF16 and INT8 quantization. AMX introduces 8 dedicated Tile registers (tmm0-tmm7) per CPU core, each holding up to 16 rows x 64 bytes. A single TDPBF16PS
instruction performs 32,768 multiply-add operations in 16 CPU cycles, giving each core 2,048 multiply-add ops per cycle, which is 8x the throughput of AVX-512 on the same silicon.
USE_BALANCE_SERVE=1 bash ./install.sh
python ktransformers/server/main.py \
--architectures Qwen3MoeForCausalLM \
--model_path <model_dir> \
--gguf_path <gguf_dir> \
--optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml \
--backend_type balance_serve
The result: On a workstation with Xeon 4th Gen + RTX 4090, KTransformers with AMX hits 347 tokens/s prefill on Qwen3MoE-235B-A22. The same model on a consumer i9-14900KF + DDR5-4000 runs smoothly at 30B-A3B, with a high-end gaming laptop as the lower bound. KTransformers also offers an AVX2-only path (--kt-method
for non-AMX CPUs), making the same MoE inference stack usable across Sapphire Rapids servers, EPYC workstations, and consumer desktops.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, Apache-2.0; AMX instruction details and 347 tokens/s prefill benchmark from doc/en/AMX.md
; Intel AMX intrinsic reference from the same doc; HN "Show HN: KTransformers-671B DeepSeek-R1 on a Single Machine" 14 points (story from 2025-02-10, 0 comments at time of indexing).
What most people do: They run inference with a single request at a time, treating the LLM like a batch script. Throughput is limited to whatever one user can squeeze out of the GPU.
The hidden trick: KTransformers v0.2.4 introduced balance_serve
, a SGLang-inspired C++ engine with three architectural layers: Server (handles OpenAI-compatible HTTP), Inference Engine (executes chunked prefill), and Scheduler (continuous batching in FCFS order). Combined with custom flashinfer
kernels and variable batch size CUDA Graphs, this design lifts aggregate throughput by 130% under 4-way concurrency on DeepSeek-R1 0528. Intel engineers validated it on Xeon6 + MRDIMM-8800, going from 17 tokens/s single-user to 40 tokens/s aggregate output throughput, with the bottleneck shifting to the GPU side.
docker pull approachingai/ktransformers:v0.2.4-AVX512
docker run -it --gpus all --privileged --shm-size 64g \
--name ktrans --network=host -v /mnt:/mnt \
approachingai/ktransformers:v0.2.4-AVX512 /bin/bash
docker exec -it ktrans bash
python ktransformers/server/main.py \
--architectures Qwen3MoeForCausalLM \
--model_path <model_dir> \
--gguf_path <gguf_dir> \
--optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml \
--backend_type balance_serve
for i in 1 2 3 4; do
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"model-name","messages":[{"role":"user","content":"Hello!"}],"stream":true}' &
done
wait
The result: A single KTransformers server now serves an entire team's interactive LLM workloads. On a Xeon6 + MRDIMM-8800 testbed, the multi-concurrency path bumped total output throughput from 17 tokens/s to 40 tokens/s, a 2.35x lift, by amortizing GPU cost across concurrent users. The OpenAI-compatible /v1/chat/completions
API means existing tooling (LangChain, LlamaIndex, Cursor, Continue.dev) drops in unchanged.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, 1,313 Forks, Apache-2.0; 130% throughput gain and 17 to 40 tokens/s benchmark from doc/en/balance-serve.md
; v0.2.4 release notes from the same doc; HN "Show HN: KTransformers-236B Model and 1M Context LLM Inference on Local Machines" 20 points (2024-08-29).
What most people do: They fine-tune MoE models with ZeRO-Offload in DeepSpeed. It works, but the CPU offload makes training painfully slow because every optimizer step shuttles hundreds of GB of gradients through the PCIe bus.
The hidden trick: KTransformers v0.6.1 ships a ktransformers[sft]
extra that integrates directly with LLaMA-Factory. The integration uses KT-Kernel's CPU-optimized INT8/INT4 quantization on the optimizer states, plus FSDP2 with intelligent sharding. The result is 6-12x training speedup over ZeRO-Offload in benchmarked MoE SFT workloads, with roughly half the CPU memory.
conda create -n kt-sft python=3.11 -y
conda activate kt-sft
pip install --extra-index-url https://download.pytorch.org/whl/cu130 \
torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1
cd /path/to/LLaMA-Factory
pip install -e .
pip install -r requirements/ktransformers.txt
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
--config_file examples/ktransformers/accelerate/fsdp2_kt_int8.yaml \
src/train.py \
examples/ktransformers/train_lora/qwen3_5moe_lora_sft_kt.yaml
The result: On DeepSeek-V3 and DeepSeek-R1, KT SFT runs at 3.7 it/s with ~80GB total GPU memory on 4x RTX 4090. Qwen3-30B-A3B trains at 8+ it/s on a single RTX 4090 with ~24GB total. This makes it feasible to fine-tune frontier MoE models on a single consumer-grade GPU instead of an 8x H100 cluster.
Data sources: kvcache-ai/ktransformers GitHub 17,264 Stars, Apache-2.0; 6-12x speedup claim and 3.7 it/s / 8+ it/s benchmarks from doc/en/SFT/KTransformers-Fine-Tuning_Quick-Start.md
and the SFT introduction in the main README; integration PR at hiyouga/LLaMA-Factory#10430; HN Show HN 20+14 points across the two launch stories (2024-08-29 and 2025-02-10).
Five production-grade techniques that turn KTransformers from a research curiosity into a 2026 AI infrastructure workhorse:
If you have read the other articles in this series, these will feel familiar: Agent Skills: 5 Hidden Uses in 49K Stars of Workflow Magic shows a similar "framework hides 5 production tricks" pattern for engineering skills, MemPalace: 5 Hidden Uses That Make It the Best-Benchmarked AI Memory System tackles memory infrastructure with comparable depth, and Goose's 5 Hidden Uses That Turn It Into a Production AI Agent Stack demonstrates the same "production tricks" pattern for the agent orchestration layer.
What is the most underrated MoE inference optimization you have hit in 2026? Drop a comment with the throughput number, the hardware, and the model, and we will dig into it in a future article.