DeepSeek V4 on Huawei's Ascend 950: A Real Stress Test for China's AI Chip Ecosystem

In April 2026, DeepSeek released its V4 model, a 1.6 trillion parameter MoE architecture, and for the first time officially validated its inference on Huawei's Ascend 950PR chip, marking a significant milestone for China's domestic AI hardware ecosystem. The article details that while the Ascend 950PR offers credible inference performance comparable to NVIDIA's China-specific H20, it still faces severe supply chain bottlenecks, including limited advanced manufacturing capacity at SMIC and constraints in 2.5D advanced packaging. Despite these challenges, Huawei's development of custom HBM memory (HiBL and HiZQ) and architectural innovations for LLM workloads are gradually closing the gap with NVIDIA's offerings.

In April 2026, DeepSeek released V4 — a 1.6 trillion parameter MoE model — and for the first time, the technical report listed Huawei's Ascend NPU alongside NVIDIA in its validated hardware list. This is the story of what that means for the actual supply chain, the bottlenecks that remain, and where this is heading. When DeepSeek released V4 on April 24, 2026, most of the attention went to the model's benchmark scores matching GPT-5 and Claude Opus. But a quieter — and arguably more consequential — event was buried in the fine print: DeepSeek V4 is the first top-tier model to fully validate inference on Huawei's Ascend 950PR chip. This wasn't a "it compiles" checkbox validation. The DeepSeek team: The results on the 950PR are genuine: A caveat: the H20 is NVIDIA's "China special" — deliberately crippled by export controls. This doesn't mean 950PR beats H100 or B200. But it does mean that for inference workloads on Chinese soil, domestic hardware is now a credible alternative, not a consolation prize. What makes the 950 interesting isn't just the raw spec sheet — it's the architectural cleverness. Huawei realized that LLM inference has two fundamentally different phases, and designed two separate chips sharing the same die: Prefill reading the entire input + computing KV cache is compute-bound — it needs raw FLOPs, not memory bandwidth. A cheaper HBM works fine. Decode generating one token at a time is memory-bandwidth-bound — the bottleneck is how fast you can feed weights to the compute units. Here, 4 TB/s bandwidth makes a real difference. The 950DT's 4 TB/s HiZQ 2.0 memory puts it in the same league as NVIDIA's H200 141GB / 4.8 TB/s . It won't be available until Q4 2026, but that's when the training-side gap starts to close. HBM accounts for roughly 50% of an AI chip's cost. Huawei's decision to develop its own HBM — HiBL low-cost/Budget Line and HiZQ high-performance — isn't just about supply chain security. It enables customization that off-the-shelf HBM can't provide. The local HBM supply chain is making real progress: The bottleneck: CXMT's HBM3 is still in testing. Raw materials only support sample runs, not mass production. The Huawei alliance is also working with Fujian Jinhua 福建晋华 and Wuhan Xinxin 武汉新芯 as secondary foundries, but these are supplementary capacity, not primary sources. The pragmatic reality: HiBL 1.0 and HiZQ 2.0 are likely "self-developed" at the packaging and controller level, not at the DRAM die level. Huawei takes available DRAM dies, packages them with proprietary 2.5D stacking, and adds custom controllers. This is why HiBL 1.0's 1.6 TB/s bandwidth is achievable — it's bounded by the dies they can source, not by their design ambition. HBM gets the headlines, but it's not the only constraint. Here are all five, ranked by severity: The hardest bottleneck. SMIC's N+2 equivalent to 7nm, using DUV multipatterning since EUV is unavailable has a monthly capacity of approximately 35,000-38,000 12-inch wafers. At ~92% yield, that translates to roughly 750,000 Ascend 950 chips per year. 750K sounds like a lot, but it serves the entire Chinese AI market. NVIDIA ships millions of H100/B200 units annually. The capacity gap is orders of magnitude. SMIC plans to double capacity to 70,000 wafers/month during 2026, but without EUV, each generation becomes exponentially harder. The 950DT uses the same N+2 process. The absolute ceiling of domestic advanced manufacturing will remain the binding constraint through at least 2028. Ascend 950 requires 2.5D Chiplet packaging 2 compute dies + 2 I/O dies + HBM . This isn't a "nice to have" — without it, you can't assemble the chip. Packaging capacity is the tightest short-term bottleneck. New capacity from JCET and Tongfu's expansion won't meaningfully add supply until 2027. This is why "advanced packaging stocks" are the hottest semiconductor theme on China's A-share market in 2026. The Atlas 950 SuperNode 8,192 cards, 160 cabinets, 1,000 square meters requires a new interconnect protocol — Lingqu 2.0 / UnifiedBus. The predecessor Lingqu 1.0 was validated on 384-card Atlas 900 systems 300+ deployed . Scaling from 384 to 8,192 is a leap in complexity: This is a 2026 Q4 delivery. The engineering risk is real, but Huawei's track record with Lingqu 1.0 proven at scale suggests this is a schedule risk, not a technology risk. CANN was fully open-sourced in December 2025. DeepSeek V4's successful port is the single biggest validation event to date. But the developer count gap is stark: ~87,000 CANN developers vs. ~3 million CUDA developers. Huawei's strategy is "CUDA-to-CANN automated conversion tools" combined with PyTorch compatibility layers. This works for standard model architectures. Edge cases still require manual operator rewriting — the same 30 person-years of work that DeepSeek invested. For large enterprises with dedicated ML teams, this is doable. For smaller teams, it's a barrier. Per-chip TDP is ~310W. At supernode scale, total power draw is in megawatts. Full liquid cooling is mandatory, and green power alignment adds infrastructure complexity. This is solvable — the technology exists — but deployment speed varies across data center operators. Huawei has a clear 3-generation roadmap: Each generation roughly doubles specs. Revenue hit $12 billion in 2026 up 60% from $7.5B in 2025 . The business is scaling. The honest assessment: Ascend will not "catch up" to NVIDIA in absolute terms. The process gap 7nm DUV vs 3nm EUV+ is physical and cannot be willed away. But it doesn't need to catch up. The Chinese AI chip market is structurally bifurcating: Ascend takes ~50% of domestic demand + NVIDIA holds the high end through H20 and smuggled/cloud-accessible H100 + Other domestic players Cambricon, Moore Threads, Biren split the remainder For anyone building AI products for the Chinese market: this is not a question of "whether to switch." It's "when to switch." For anyone building for global markets: unaffected — continue with CUDA. Two technology worlds are solidifying: CUDA World and CANN World. Before April 2026, Huawei could say "our chips work." After April 2026, DeepSeek proved it with a 1.6T-parameter model, real production traffic, and actual cost numbers. The credibility gap is closed. The remaining bottlenecks are all physical or temporal: more chips, more packaging lines, more fab capacity, more time for the ecosystem to mature. None of these have a quick fix. But they also don't depend on any single breakthrough — they're a production scaling problem, and production scaling responds to money and time. China's AI chip ecosystem just passed its most important stress test. The bottlenecks that remain are hard, but they're the kind of hard that follows linear progress curves — not the binary win/lose of "can this even work."