# Kog hits 3K t/s on MI300X, no kernel switches — test it now

> Source: <https://dev.to/creeta/kog-hits-3k-ts-on-mi300x-no-kernel-switches-test-it-now-55dh>
> Published: 2026-06-17 04:59:23+00:00

AMD's MI300X has long had more single-request inference headroom than the default ROCm stack exposes. A Paris startup just showed how much — by deleting the per-token kernel launch entirely.

A monokernel is a single, persistent GPU-resident program that runs an entire LLM decode pass — prefill, decode, LM-head sampling, and the EOS stop check — without returning to the host CPU or launching a new kernel per token. Kog AI reports **3,000+ output tokens/s per request** for an FP16 2B model at batch size 1 on one 8× MI300X node , the engine behind the Kog Inference Engine tech preview launched 28 May 2026. That matters because batch-1 decoding is bound by HBM bandwidth, not compute — so the dead time between kernels dominates.

**Quick Answer:** Standard MI300X stacks launch one GPU kernel per token, each paying ~4.5 μs launch overhead plus HBM restart latency. Kog's monokernel collapses the whole decode loop into one persistent kernel with zero CPU interaction, reaching 3,000+ tokens/s per request on an 8× MI300X node (FP16 2B model, batch 1).

Conventional stacks — vLLM, SGLang, ROCm/HIP pipelines — launch a fresh kernel for every stage of every token. Kog quantifies the recurring tax that removes :

| Overhead source | Cost per occurrence |
|---|---|
| Kernel launch (per stage) | ~4.5 μs |
| HBM latency on each memory-load restart | ~0.5 μs |
| Intermediate tensor materialization round-trip to HBM | >1 μs |

Synchronization is rebuilt to match. Instead of atomic arrival counters, buffers initialize to NaN and consumers poll until real data appears — sentinel-value polling that cuts sync latency from ~7.8 μs to ~0.9 μs, though synchronization still eats roughly 35% of token-generation time . Is the peak number solid? A topology-tuned variant grouping compute units by HBM die adjacency is cited at **3,300 tokens/s**, but that figure comes from a secondary report rather than the primary blog (which states 3,000+) — treat the exact peak cautiously, as this is single-vendor, self-reported data with no independent benchmark yet.

There are exactly two ways to engage with Kog's work today, and they sit at opposite ends of the effort spectrum. The hosted Kog Inference Engine (KIE) playground is a zero-setup, browser-accessible demo; the raw HIP replication is a research-level undertaking. For nearly every developer, the playground is the only immediately actionable option — the HIP path is not a weekend project.

The playground at playground.kog.ai runs the Laneformer 2B coding model — which scores roughly 50% on HumanEval — on Kog's own 8× MI300X cluster . You interact with the model in the browser and watch the per-request token rate firsthand, with no hardware to provision. It is the fastest way to verify the latency claim with your own prompts.

The HIP replication path is a different category of work. To reproduce the monokernel you need an AMD Instinct GPU, a ROCm 6.x stack, and deep HIP/assembly experience — the implementation required hand-written inline assembly for atomics on 3-dword types, manual register-pressure management (LICM, instruction inspection), and a custom cross-GPU timestamp profiling harness synced via the HSA API .

Crucially, as of June 2026 there is no open-source kernel and no pip package . The Kog engineering blog is the only public implementation reference — a detailed writeup, not a clonable repo. If you want the technique, you reimplement it from the prose.

Start at the playground, then escalate to HIP only if you need the technique itself. The zero-setup path is [playground.kog.ai](https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/): open the page, submit a coding prompt, and watch the per-request token counter in the response UI. The model behind it is Laneformer 2B running FP16 on Kog's 8× MI300X node, scoring roughly 50% on HumanEval, with no login required for the tech preview launched 28 May 2026 . That single page is enough to verify the latency claim with your own prompts.

To replicate the technique, orientation comes from the [Kog engineering writeup](https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus/). It documents compile-time work partitioning, a 256-compute-unit grid with `gridDim=(256,)`

and `blockDim=(64,8)`

, and tensor duplication per I/O die to avoid cross-die reduction penalties on the chiplet design . Two implementation details matter most before you attempt the full loop:

`dot2`

instructions rather than matrix cores — tensor cores only earn their keep once batch size fills their tile . Replicate this for your weight shapes first."The monokernel collapses the entire decode loop — including sampling and the EOS stop check — into one persistent kernel, so the host CPU never re-enters the path," per Kog's engineering team (source:

[Kog AI blog]).

If hand-written HIP and inline assembly are more than you want to own, start one level up with AMD's [AITER](https://github.com/ROCm/aiter) (AI Tensor Engine for ROCm) — the sanctioned reference, with Triton, Composable Kernel, HIP, and hand-tuned assembly backends already wired into vLLM and SGLang . A minimal "does Kog respond" check looks like the illustrative snippet below — it is not executed here (it needs the Kog runtime/CLI), and exits cleanly when that dependency is absent:

``` python
import importlib.util
import shutil
import subprocess
import sys

if not shutil.which("kog") and importlib.util.find_spec("kog") is None:
    raise SystemExit("needs dependency: kog runtime/CLI")

cmd = ["kog", "bench", "--device", "mi300x", "--target-tps", "3000", "--no-kernel-switches"]
print("+", " ".join(cmd))
out = subprocess.check_output(cmd, text=True, stderr=subprocess.STDOUT)
print(out)
```

The headline numbers describe one narrow configuration: a custom 2B-parameter "Laneformer" model running at FP16 and batch size 1 on a single 8× MI300X node . As of June 2026, there is no published evidence that the monokernel generalizes to larger dense or MoE architectures, to FP8 or other quantized precisions, to batch sizes above 1, or to multi-node setups — the AI Weekly summary flags exactly these as unproven .

The results are also entirely self-reported. No independent third-party benchmark has appeared, and the widely circulated 3,300 t/s figure originates from AI Weekly's 29 May 2026 write-up of a topology-tuned variant, not Kog's primary blog, which states 3,000+ . Treat the exact peak cautiously until someone outside Kog reproduces it.

The cross-vendor comparison carries the same caveat: Kog reports a sibling monokernel reaching ~2,100 t/s on 8× NVIDIA H200 under identical FP16, batch-1 conditions — also self-reported, with no external validation.

Finally, several capabilities developers will want are roadmap items, not shipping features. Kog lists third-party MoE model support, quantization such as FP8, speculative decoding, and larger batch sizes as planned but not yet delivered .

To understand why topology tuning matters, look at the die map. The MI300X is a CDNA3 chiplet design: 8 Accelerator Compute Dies (XCDs) holding 304 compute units total — 38 per XCD — sitting atop 4 I/O dies (IODs), with 192 GB of HBM3 at roughly 5.3 TB/s peak bandwidth . Kog's monokernel deliberately uses 256 of the 304 CUs and duplicates tensors per IOD, trading a little memory for the avoidance of cross-die all-reduce penalties that would otherwise stall a single-request decode .

If you want to start on MI300X attention kernels without Kog-level resources, AMD's AITER MLA decode tutorial on the ROCm AI Developer Hub is the lowest-friction on-ramp. It targets Ubuntu 22.04 and ROCm 6.3.1, runs in a Docker container with `/dev/kfd`

and `/dev/dri`

exposed, and walks through cloning [AITER](https://github.com/ROCm/aiter) recursively, running `python3 setup.py develop`

, and calling `mla_decode_fwd`

directly .

As for Kog itself, the KIE tech preview post lists third-party MoE models, additional batch sizes, quantization, and speculative decoding as planned, with no dates attached . The takeaway: the 3K t/s number is a single-request, single-model proof point, not a general benchmark — try the playground today, watch [blog.kog.ai](https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/) for the roadmap, and reach for AITER when you need a reproducible kernel path now.

No. The KIE tech preview is a hosted browser playground at [playground.kog.ai](https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/), running the Laneformer 2B coding model on Kog's own 8× MI300X cluster . You interact through the browser and watch the per-request token rate directly — no local GPU, drivers, or setup required. You only need your own MI300X if you want to replicate the monokernel in HIP from the engineering writeup yourself.

At batch size 1, decode is a GEMV (matrix-vector multiply), not a GEMM, so matrix cores stay idle. Tensor/matrix-core primitives only pay off when the batch is large enough to fill their tile; a single-vector multiply cannot. Kog therefore implements the projection with scalar/vector ALU `dot2`

instructions, which are faster for batch-1 decode where HBM bandwidth — not compute — is the bottleneck .

Delayed Tensor Parallelism (DTP) defers the tensor-parallel all-reduce from attention and FFN and folds it into the computation of later layers, so cross-GPU traffic over Infinity Fabric runs asynchronously, hidden behind arithmetic . This avoids the synchronous communication stall that normally penalizes 8-GPU tensor parallelism at batch 1, where the model is split into 8 lanes across 8 GPUs and a blocking reduction per layer would otherwise dominate latency.

AITER (AI Tensor Engine for ROCm) is a framework-level operator library with Triton, Composable Kernel, HIP, and hand-tuned assembly backends, already wired into vLLM and SGLang production-serving paths . Kog's monokernel is the opposite: a hand-crafted, compile-time work-partitioned single kernel with no framework abstraction, written in HIP with inline assembly. It is lower-level, not open-sourced, and demonstrated only on a custom 2B model — AITER is the reproducible path when you need a kernel today .

No. The Kog engineering blog states 3,000+ output tokens per second per request for an FP16 2B model at batch size 1 on a single 8× MI300X node . The 3,300 figure appeared in an AI Weekly summary on 29 May 2026 describing a topology-tuned variant . With no independent replication as of June 2026, treat 3,000+ as the primary number.
