{"slug": "bringing-up-deepseek-v4-flash-on-amd-mi300x", "title": "Bringing Up DeepSeek-V4-Flash on AMD MI300X", "summary": "AMD's MI300X accelerator, with 192GB of HBM3 memory and roughly half the list price of NVIDIA's H100, remains underutilized due to software incompatibilities. As of early May 2026, running vLLM with DeepSeek-V4-Flash on the MI300X is not possible because the chip uses the non-standard \"fnuz\" FP8 dialect, which is incompatible with the OCP-standard FP8 supported by newer AMD chips and the software ecosystem. This software gap prevents cloud providers from leveraging the MI300X's competitive hardware specifications for AI inference workloads.", "body_md": "# Bringing up DeepSeek-V4-Flash on AMD MI300X\n\nAt [Doubleword](https://app.doubleword.ai) we are building an inference\ncloud designed for volume. To do that we have to reckon with the\nenveloping compute shortage.\n\nAMD’s MI300X launched in December 2023At AMD’s [“Advancing AI” event](https://www.amd.com/en/newsroom/press-releases/2023-11-15-amd-announces-amd-instinct-mi300-accelerator-launc.html), 6 December 2023. as AMD’s response to NVIDIA’s\nH100, arriving alongside H200 in the same generation. It is an odd duck\nin the world of high-end AI accelerators. While H100 prices are climbing\n(up 40% in five months on one-year rentals, with on-demand capacity sold\nout across every major NVIDIA partSemiAnalysis, [The Great GPU Shortage: Rental Capacity](https://newsletter.semianalysis.com/p/the-great-gpu-shortage-rental-capacity), April 2026.), MI300X is perhaps still\nunderappreciated. 192GB of HBM3 per card against the H100’s 80GB,\ncomparable FP8 compute, list price roughly half. Yet you can rent one\non-demand today (from [Hotaisle](https://www.hotaisle.ai/), for instance)\nfor noticeably less than the equivalent NVIDIA capacity.\n\nThe reason is software. The problems with running AI workloads on AMD have\n[been written about\nelsewhere](https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200-benchmark-part-1-training)\nexhaustively, and there are signs the gap is closing on AMD’s newer chipsSemiAnalysis’s [InferenceX dashboard](https://inferencex.semianalysis.com/inference) tracks the latest AMD parts (MI350X, MI355X) against current NVIDIA generations..\nThat new focus on software hasn’t extended back to old parts. As of early May\n2026, running vLLM with DeepSeek-V4-Flash on MI300X just doesn’t work.\n\nOn paper MI300X is an excellent accelerator. We want it to work. This post is a worklog of all the sharp edges and winding paths we found when we tried to get it working.\n\n## FP8 dialect\n\nThe MI300X was part of the accelerator generation that kicked off the march toward lower bitwidths. LLM weights, and to a lesser extent activations and KV caches, are less sensitive to numerical imprecision than typical HPC workloads, so the Hopper generation of NVIDIA chips and the first Instinct chips added hardware support for sub-16-bit precision for the first time. The result is twice as many FLOPs applied to workloads that correspondingly transfer half as much data.\n\nThe problem is that there was disagreement on the best way to build an\nFP8 datatype. Graphcore and AMD proposed [one standard](https://www.graphcore.ai/posts/graphcore-and-amd-propose-8-bit-fp-ai-standard-with-qualcomm-support)\nin a [2022 preprint](https://arxiv.org/pdf/2206.02915), backed by\nQualcomm. Arm, Intel, and NVIDIA proposed [another](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1)\nthrough the Open Compute Project. In a rehash of some of the forks in\nthe road that led to IEEE 754This interview with [William Kahan](https://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html)\nis great read for how an arithmetic standard actually gets\nmade, including which arguments win and which are forgotten., different providers built in\ndifferent and incompatible behaviours.\n\nPerhaps unsurprisingly given the list of backers on each side, the\nAMD / Graphcore standard didn’t make it. AMD’s newer MI325, MI350, and\nMI355X chips all moved over to OCP-standard FP8. But MI300X still only\nworks in the `fnuz`\n\ndialect`fnuz`\n\nmeans “finite, nans, unsigned zero”, i.e. no `-0`\n\nand no\n`inf`\n\n. These seem like sensible things to cut out for AI workloads at\nsmall floating-point range, where every bit matters, but the dialect\nnever quite took off, and later AMD generations went back to the more\nnormal-looking FP8., so the initial vLLM work that went into\nbringing up DeepSeek on AMD didn’t actually work for bringing DeepSeek\nup on MI300X.\n\nLots of vLLM’s FP8 paths are aware of `e4m3`\n\nversus `e5m2`\n\nbut not of\n`fnuz`\n\nversus OCP. The two share their bit layout but differ in exponent\nbias by one, so the same byte read as the wrong dialect comes back off\nby exactly a factor of two. MI300X is the only major accelerator where\nthat distinction matters in practiceThroughout, we’ll note the relevant commits from the demo PRs in a public vLLM repo we put up for this post. [ 236de4e64](https://github.com/doublewordai/vllm-amd-blog-doubleword/commit/236de4e64) makes the\nDeepSeek v4 compressor and fused compress / quant / cache writes use the\nplatform FP8 dtype so scales and cache bytes agree, and\n\n[routes the sliding-window K-cache through a fnuz-aware fused quantise-and-insert helper..](https://github.com/doublewordai/vllm-amd-blog-doubleword/commit/bd06e5d87)\n\n`bd06e5d87`\n\n## Missing attention fast paths\n\nDeepSeek v4’s attention is [sparse](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro). Each query attends to a top-k subset of the\nKV cache picked by a learned indexer, with sliding-window context\nhandled separately.\n\nIt’s got a lot of moving pieces: KV compression, the indexer, the sliding-window path, FP8 caches feeding each. In a production deployment for maximum performance, each piece needs special attention (no pun intended) in the form of a tuned kernel.\n\nThe source of fast tuned kernels on AMD is [AITER](https://github.com/rocm/aiter).\nAITER is AMD’s tuned-kernel library, roughly the analog of what NVIDIA\nusers get from cuBLAS, cuDNN, FlashAttention, and Transformer Engine\ncombined. vLLM falls back to generic Triton when AITER doesn’t have a\npath for a given shape, and generic Triton attention is several times\nslower than a tuned kernel. AITER’s coverage for DSV4 is uneven, and\nwhat coverage exists tends to target later AMD parts (CDNA4) rather\nthan the CDNA3 (gfx942) cores in MI300X.\n\nThe fallout from this has two different shapes. Some pieces are missing AITER paths\nentirely on gfx942: paged MQA logits, sparse MLA prefill, sparse MLA\ndecode. For each we need to put in a ROCm-specific helper that calls into AITER\nwhere it exists and falls through to a Triton implementation where it\ndoesn’t. Some pieces have AITER paths that exist but break specifically on\ngfx942: AITER prefill MQA logits and AITER sparse prefill logits both\nfall here. The fix is to refuse to dispatch into them when\n`current_platform`\n\nreports gfx942 and let the Triton fallback handle\nthe call insteadSee [ cb8a18556](https://github.com/doublewordai/vllm-amd-blog-doubleword/commit/cb8a18556): paged MQA and sparse MLA fallbacks, AITER guards on gfx942, and correctness coverage..\n\n## HIP graphs\n\nHIP graphs are AMD’s analog of CUDA graphs, with effectively the same semantics: record the stream of operations once at warmup, replay the recorded graph on every subsequent step. The win is removing per-launch Python overhead from the decode loop, which matters a lot when you launch hundreds of small kernels per token. Since DeepSeek v4 has so many moving parts, there would be a lot of kernel launches if we didn’t leverage graphs.\n\nThe price is that the captured region has to be a pure function of its device inputs. Anything that reads from the host, allocates a ragged tensor whose shape depends on the live batch, or synchronises inside the captured region gets recorded with whatever value it had at warmup and replayed forever after.\n\nThe AITER tuned kernels compose with this by construction. AITER kernels are C++\nlaunches that take device pointers and sizes; they don’t allocate\nragged scratch from Python and they don’t read host scalars mid-stream.\nIt’s pretty easy to write a Triton kernel that doesn’t work nicely, we did that a couple times[ 22cc02230](https://github.com/doublewordai/vllm-amd-blog-doubleword/commit/22cc02230) rebuilds the sparse MLA decode metadata as static, capture-safe tensors: no dynamic ragged allocations, no host-to-device scalar writes under capture..\n\n## Loose ends\n\nWe ran into a bunch of smaller issues:\n\n- An MoE routing bug where the expert-mask shape was gated on whether\nROCm AITER was globally enabled, not on whether the matmul about to\nbe called was actually AITER’s. With AITER globally on but MXFP4\nfalling through to the emulation backend, the kernel got the wrong\nmask and tokens routed to the wrong experts.\n.`8b5f7aa2c`\n\n- A Triton kernel that masks padded lanes against the global tensor\nbound rather than the logical block size. At high concurrency the\npadded lanes scribbled across the MoE routing bitmatrix.\n.`c32932bb9`\n\n## Tuning it up\n\nWith correctness sorted, we can do some basic optimization.\n\nThe first profile of a working DSV4-Flash on MI300X shows that the expensive layers are the sparse MLA path and the MXFP4 MoE path. This is good — if it wasn’t the case we’d be really screwed.\n\nHowever, after first bring up a meaningful slice of the time is not in the matmuls\nthemselves but in the bookkeeping & tuning around themSparse MLA decode rebuilds ragged metadata every step. The decode kernel\nwrites to a scratch tensor and then copies into the caller’s output buffer. The\nbf16 projection weight gets materialised every decode step instead of cached.\nOne static Triton launch shape covers both the small-batch ramp and saturated\nserving. The MXFP4 OGS tile shape is similarly a single static choice across\nregimes that don’t look anything alike. [ doublewordai/vllm-amd-blog-doubleword#2](https://github.com/doublewordai/vllm-amd-blog-doubleword/pull/2)..\n\nOn our simple benchmark that takes the box from 2485 to 2699 output tok/s per GPU, about +8.6%.\n\n## Was it worth it?\n\nAfter bringing up the model, optimizing it, and testing it, we get pretty good numbers:\n\nThis is a win: MI300X rents for roughly half the price of the NVIDIA capacity it competes with, carries more than twice the HBM per card, and is available on-demand right now, even as H100 and H200 lead times stretch out. We haven’t done the maths to prove that we can get a win on tokens per second per dollar over NVIDIA hardware, but we’ve proven that with hard work we can get close enough to make it useful.\n\nMost of what made it so hard is temporary. The FP8 dialect problem is specific to CDNA3: MI325, MI350, and MI355X all moved to OCP-standard FP8, so the off-by-a-factor-of-two trap does not exist on newer parts. The AITER coverage gaps will fill in over time as AMD’s kernel work catches up to its own hardware. And since we did this work, even while we prepared to open-source it, vLLM’s performance & stability on this model have improved.\n\nAMD’s hardware has been good for a while. The reason the software gap is\nfinally closing is partly AMD’s own focus, and partly that the cost of doing\nthis kind of workAll of the fixes in this post live as demo PRs in a public\n[vLLM repo](https://github.com/doublewordai/vllm-amd-blog-doubleword) we put up to accompany it; the commits are linked\ninline throughout. We intend to upstream the parts that make sense for everyone. has dropped through the floor with the rise of agentic coding. As\na result of both factors, if you send your DeepSeek-V4-Flash requests to the\n[Doubleword API](https://app.doubleword.ai), the response might be AMD-powered.\n\n[Suggest an edit](https://github.com/fergusfinn/blog/edit/main/src/content/blog/deepseek-v4-flash-mi300x.mdx)\n\nLast modified: 1 Jun 2026", "url": "https://wpnews.pro/news/bringing-up-deepseek-v4-flash-on-amd-mi300x", "canonical_source": "https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/", "published_at": "2026-06-02 17:52:48+00:00", "updated_at": "2026-06-02 20:16:27.964372+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-chips", "large-language-models", "artificial-intelligence", "ai-products"], "entities": ["Doubleword", "AMD", "MI300X", "NVIDIA", "H100", "H200", "Hotaisle", "SemiAnalysis"], "alternates": {"html": "https://wpnews.pro/news/bringing-up-deepseek-v4-flash-on-amd-mi300x", "markdown": "https://wpnews.pro/news/bringing-up-deepseek-v4-flash-on-amd-mi300x.md", "text": "https://wpnews.pro/news/bringing-up-deepseek-v4-flash-on-amd-mi300x.txt", "jsonld": "https://wpnews.pro/news/bringing-up-deepseek-v4-flash-on-amd-mi300x.jsonld"}}