Kog hits 3K t/s on MI300X, no kernel switches — test it now

Kog AI achieved over 3,000 output tokens per second per request for an FP16 2B model on a single 8× MI300X node using a monokernel that eliminates per-token kernel launches. The technique collapses the entire LLM decode loop into one persistent GPU-resident program, avoiding CPU interaction and reducing synchronization latency. A hosted playground is available for testing, while the HIP replication requires deep AMD GPU expertise.

AMD's MI300X has long had more single-request inference headroom than the default ROCm stack exposes. A Paris startup just showed how much — by deleting the per-token kernel launch entirely. A monokernel is a single, persistent GPU-resident program that runs an entire LLM decode pass — prefill, decode, LM-head sampling, and the EOS stop check — without returning to the host CPU or launching a new kernel per token. Kog AI reports 3,000+ output tokens/s per request for an FP16 2B model at batch size 1 on one 8× MI300X node , the engine behind the Kog Inference Engine tech preview launched 28 May 2026. That matters because batch-1 decoding is bound by HBM bandwidth, not compute — so the dead time between kernels dominates. Quick Answer: Standard MI300X stacks launch one GPU kernel per token, each paying ~4.5 μs launch overhead plus HBM restart latency. Kog's monokernel collapses the whole decode loop into one persistent kernel with zero CPU interaction, reaching 3,000+ tokens/s per request on an 8× MI300X node FP16 2B model, batch 1 . Conventional stacks — vLLM, SGLang, ROCm/HIP pipelines — launch a fresh kernel for every stage of every token. Kog quantifies the recurring tax that removes : | Overhead source | Cost per occurrence | |---|---| | Kernel launch per stage | ~4.5 μs | | HBM latency on each memory-load restart | ~0.5 μs | | Intermediate tensor materialization round-trip to HBM | 1 μs | Synchronization is rebuilt to match. Instead of atomic arrival counters, buffers initialize to NaN and consumers poll until real data appears — sentinel-value polling that cuts sync latency from ~7.8 μs to ~0.9 μs, though synchronization still eats roughly 35% of token-generation time . Is the peak number solid? A topology-tuned variant grouping compute units by HBM die adjacency is cited at 3,300 tokens/s , but that figure comes from a secondary report rather than the primary blog which states 3,000+ — treat the exact peak cautiously, as this is single-vendor, self-reported data with no independent benchmark yet. There are exactly two ways to engage with Kog's work today, and they sit at opposite ends of the effort spectrum. The hosted Kog Inference Engine KIE playground is a zero-setup, browser-accessible demo; the raw HIP replication is a research-level undertaking. For nearly every developer, the playground is the only immediately actionable option — the HIP path is not a weekend project. The playground at playground.kog.ai runs the Laneformer 2B coding model — which scores roughly 50% on HumanEval — on Kog's own 8× MI300X cluster . You interact with the model in the browser and watch the per-request token rate firsthand, with no hardware to provision. It is the fastest way to verify the latency claim with your own prompts. The HIP replication path is a different category of work. To reproduce the monokernel you need an AMD Instinct GPU, a ROCm 6.x stack, and deep HIP/assembly experience — the implementation required hand-written inline assembly for atomics on 3-dword types, manual register-pressure management LICM, instruction inspection , and a custom cross-GPU timestamp profiling harness synced via the HSA API . Crucially, as of June 2026 there is no open-source kernel and no pip package . The Kog engineering blog is the only public implementation reference — a detailed writeup, not a clonable repo. If you want the technique, you reimplement it from the prose. Start at the playground, then escalate to HIP only if you need the technique itself. The zero-setup path is playground.kog.ai https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ : open the page, submit a coding prompt, and watch the per-request token counter in the response UI. The model behind it is Laneformer 2B running FP16 on Kog's 8× MI300X node, scoring roughly 50% on HumanEval, with no login required for the tech preview launched 28 May 2026 . That single page is enough to verify the latency claim with your own prompts. To replicate the technique, orientation comes from the Kog engineering writeup https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus/ . It documents compile-time work partitioning, a 256-compute-unit grid with gridDim= 256, and blockDim= 64,8 , and tensor duplication per I/O die to avoid cross-die reduction penalties on the chiplet design . Two implementation details matter most before you attempt the full loop: dot2 instructions rather than matrix cores — tensor cores only earn their keep once batch size fills their tile . Replicate this for your weight shapes first."The monokernel collapses the entire decode loop — including sampling and the EOS stop check — into one persistent kernel, so the host CPU never re-enters the path," per Kog's engineering team source: Kog AI blog . If hand-written HIP and inline assembly are more than you want to own, start one level up with AMD's AITER https://github.com/ROCm/aiter AI Tensor Engine for ROCm — the sanctioned reference, with Triton, Composable Kernel, HIP, and hand-tuned assembly backends already wired into vLLM and SGLang . A minimal "does Kog respond" check looks like the illustrative snippet below — it is not executed here it needs the Kog runtime/CLI , and exits cleanly when that dependency is absent: python import importlib.util import shutil import subprocess import sys if not shutil.which "kog" and importlib.util.find spec "kog" is None: raise SystemExit "needs dependency: kog runtime/CLI" cmd = "kog", "bench", "--device", "mi300x", "--target-tps", "3000", "--no-kernel-switches" print "+", " ".join cmd out = subprocess.check output cmd, text=True, stderr=subprocess.STDOUT print out The headline numbers describe one narrow configuration: a custom 2B-parameter "Laneformer" model running at FP16 and batch size 1 on a single 8× MI300X node . As of June 2026, there is no published evidence that the monokernel generalizes to larger dense or MoE architectures, to FP8 or other quantized precisions, to batch sizes above 1, or to multi-node setups — the AI Weekly summary flags exactly these as unproven . The results are also entirely self-reported. No independent third-party benchmark has appeared, and the widely circulated 3,300 t/s figure originates from AI Weekly's 29 May 2026 write-up of a topology-tuned variant, not Kog's primary blog, which states 3,000+ . Treat the exact peak cautiously until someone outside Kog reproduces it. The cross-vendor comparison carries the same caveat: Kog reports a sibling monokernel reaching ~2,100 t/s on 8× NVIDIA H200 under identical FP16, batch-1 conditions — also self-reported, with no external validation. Finally, several capabilities developers will want are roadmap items, not shipping features. Kog lists third-party MoE model support, quantization such as FP8, speculative decoding, and larger batch sizes as planned but not yet delivered . To understand why topology tuning matters, look at the die map. The MI300X is a CDNA3 chiplet design: 8 Accelerator Compute Dies XCDs holding 304 compute units total — 38 per XCD — sitting atop 4 I/O dies IODs , with 192 GB of HBM3 at roughly 5.3 TB/s peak bandwidth . Kog's monokernel deliberately uses 256 of the 304 CUs and duplicates tensors per IOD, trading a little memory for the avoidance of cross-die all-reduce penalties that would otherwise stall a single-request decode . If you want to start on MI300X attention kernels without Kog-level resources, AMD's AITER MLA decode tutorial on the ROCm AI Developer Hub is the lowest-friction on-ramp. It targets Ubuntu 22.04 and ROCm 6.3.1, runs in a Docker container with /dev/kfd and /dev/dri exposed, and walks through cloning AITER https://github.com/ROCm/aiter recursively, running python3 setup.py develop , and calling mla decode fwd directly . As for Kog itself, the KIE tech preview post lists third-party MoE models, additional batch sizes, quantization, and speculative decoding as planned, with no dates attached . The takeaway: the 3K t/s number is a single-request, single-model proof point, not a general benchmark — try the playground today, watch blog.kog.ai https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ for the roadmap, and reach for AITER when you need a reproducible kernel path now. No. The KIE tech preview is a hosted browser playground at playground.kog.ai https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ , running the Laneformer 2B coding model on Kog's own 8× MI300X cluster . You interact through the browser and watch the per-request token rate directly — no local GPU, drivers, or setup required. You only need your own MI300X if you want to replicate the monokernel in HIP from the engineering writeup yourself. At batch size 1, decode is a GEMV matrix-vector multiply , not a GEMM, so matrix cores stay idle. Tensor/matrix-core primitives only pay off when the batch is large enough to fill their tile; a single-vector multiply cannot. Kog therefore implements the projection with scalar/vector ALU dot2 instructions, which are faster for batch-1 decode where HBM bandwidth — not compute — is the bottleneck . Delayed Tensor Parallelism DTP defers the tensor-parallel all-reduce from attention and FFN and folds it into the computation of later layers, so cross-GPU traffic over Infinity Fabric runs asynchronously, hidden behind arithmetic . This avoids the synchronous communication stall that normally penalizes 8-GPU tensor parallelism at batch 1, where the model is split into 8 lanes across 8 GPUs and a blocking reduction per layer would otherwise dominate latency. AITER AI Tensor Engine for ROCm is a framework-level operator library with Triton, Composable Kernel, HIP, and hand-tuned assembly backends, already wired into vLLM and SGLang production-serving paths . Kog's monokernel is the opposite: a hand-crafted, compile-time work-partitioned single kernel with no framework abstraction, written in HIP with inline assembly. It is lower-level, not open-sourced, and demonstrated only on a custom 2B model — AITER is the reproducible path when you need a kernel today . No. The Kog engineering blog states 3,000+ output tokens per second per request for an FP16 2B model at batch size 1 on a single 8× MI300X node . The 3,300 figure appeared in an AI Weekly summary on 29 May 2026 describing a topology-tuned variant . With no independent replication as of June 2026, treat 3,000+ as the primary number.