{"slug": "nvidia-already-won-training-the-real-fight-is-inference", "title": "Nvidia Already Won Training. The Real Fight Is Inference", "summary": "Nvidia dominates AI training hardware, but the inference market is wide open as companies like Cerebras, Groq, d-Matrix, Etched, and Taalas challenge Nvidia's GPU-based approach by designing chips optimized for low-latency inference, where real-time user experience is critical.", "body_md": "I have spent the last few weeks falling down a particular rabbit hole, and I have come out of it convinced that most conversations about AI hardware are arguing about the wrong thing: training — who can assemble the biggest cluster, who can train the next frontier model. And on that question from the hardware perspective there is genuinely nothing to discuss. If you are training a large model, you buy Nvidia. Full stop. The combination of raw GPU horsepower and the CUDA software ecosystem is so far ahead that the hardware choice is not a choice at all. Short of some genuinely new model architecture coming along to reset the board, that race was over before it started.\n\nBut training is a one-off. You build a model once, then you run it billions of times a day, for every user and every prompt, effectively forever (i.e. until you deprecate the today frontier in a couple of months’ time). Running the model is called inference, and it is where the real money will be made in years to come. It is also, unlike training and critically important to this text, wide open from the hardware perspective.\n\nAnd here is the twist that pulled me down the hole. Inference looks like a single problem but is in fact two, and the half that matters most is a race against the clock. Get it wrong and the user watches a cursor blink; get it right and the answer feels instant. That clock, inference-time latency, turns out to be something Nvidia’s general-purpose GPU was never really built to win, and a wave of companies has noticed.\n\nWhat follows is a field guide to five of them: Cerebras, Groq, d-Matrix, Etched and Taalas. Some attack the problem by rethinking memory; others rethink the shape of the chip itself. None of them is trying to beat Nvidia at training. They are all going after the half Nvidia left on the table.\n\nLet us start with the incumbent, because everything else is a reaction to it. You can yawn along if you know GPUs inside-out, but bear with me.\n\nA GPU is a general-purpose parallel calculator on steroids. It was originally built to draw video game graphics, which means doing the same simple maths on millions of pixels in parallel. That happens to be roughly the same shape of work as running a neural network — endless matric multiplications — which is why GPUs were repurposed for AI. The key word is *general*. An Nvidia chip will happily run a language model today, a protein-folding model tomorrow, and a weather simulation next week. It does not care what you throw at it, as long as it’s parallelizable.\n\nThat flexibility is the first pillar of Nvidia’s dominance. The second is CUDA — the software layer that lets programmers tell the GPU what to do. Over nearly two decades, Nvidia has built an enormous ecosystem on top of it: libraries, compilers, tools, and the accumulated habits of millions of developers. Almost every AI framework in the world is tuned to run on CUDA first. So even a competitor with faster silicon has to convince developers to abandon the ecosystem and community they already know, which is a much harder sell than simply being quick.\n\nTogether, raw parallel power and the CUDA moat are why training belongs to Nvidia and is not really contestable. But the moment we move from building a model to running it, the picture changes completely. To see why, we need to look at what actually happens when you use an AI model for inference.\n\nLet’s discuss latency. Training is a batch job measured in weeks, so a few stray milliseconds here or there vanish against the total. Inference is the opposite: a real person is waiting for every token, and each millisecond of delay is felt immediately. This single fact is the lever almost every challenger in this article pulls.\n\nBut latency hides a subtlety, because LLM inference is not one kind of work. It is two. When you serve many users at once, you are constantly juggling two very different phases.\n\nThe first part — prefill — helps LLM understand what your input is about. When a new request arrives, the model processes the entire prompt in one go, a single large matrix multiplication where the attention mechanism digests everything you typed. Each token simultaneously, in parallel. It hammers the chip hard for a short burst, and it batches beautifully: you can stack many users’ prompts together and chew through them at once. This is massively parallel work, exactly what a GPU was born for.\n\nThe second part — decode — is slow and sequential. Once the prompt is understood, the model generates the answer one token at a time, and each new token depends on the one before it. There is no batching your way out of the sequence; it is inherently step by step. This part is also where the so-called memory wall kicks in (more on it later). Constant movement of model weights, together with the ever-growing KV cache, makes the process memory speed-bound, and not compute bound.\n\nHere is the tension. If you run both phases on the same GPU, a flood of new prefill work can stall the ongoing decode, so existing users suddenly see their answers pause mid-sentence, a problem called *prefill interference*. Here’s where it gets interesting: increasingly popular fix is so called ‘disaggregation’: stop forcing one chip to serve both stages, and instead run prefill stage on parallel-hungry hardware and decode stage on hardware tuned for fast, predictable, yet sequential work.\n\nAs indicated before, when a chip runs an LLM, the bottleneck is usually not the compute. Modern chips can multiply numbers extremely fast. The bottleneck is fetching the numbers to multiply. The model’s weights, tens or hundreds of gigabytes of them, have to be hauled from memory into the part of the chip that does the maths, over and over again, for every single token the model produces. The calculating units end up sitting idle, waiting for data to arrive. Engineers call this the memory wall, and it is the crack that every company below is trying to prise open.\n\nTo see why memory is the problem, you need to understand the three main kinds of memory a chip can use, because the difference between them comes up again and again:\n\nDRAM (Dynamic Random Access Memory) is the cheap, dense, high-capacity memory that sits furthest from the processor. Its variants, LPDDR in phones and laptops and GDDR in consumer graphics cards, can hold enormous amounts of data very cheaply. The catch is speed. A DRAM access takes on the order of 100+ nanoseconds, and, more importantly, the bus connecting it to the chip cannot move data fast enough to keep thousands of GPU calculating units fed. For a model that streams tens of gigabytes of weights through the chip for every token, plain DRAM is a drinking straw bolted to a fire hydrant. No serious AI accelerator feeds its compute from ordinary DRAM, and why HBM had to be invented.\n\nHBM (High Bandwidth Memory) is the answer the industry reached for. It is high-density memory stacked into tall micro-towers and placed right beside the chip on the same package, though not inside the chip itself. It is also the reigning king of AI memory in 2026, and its supply is a textbook oligopoly: as of [June 2026](https://siliconanalysts.com/tools/hbm-analysis), SK Hynix leads with approximately 50–55% market share, followed by Samsung at 35–40% and Micron at 5–10%, which is essentially the whole market split between three firms, with no fourth player in sight. That concentration is why a shortage of HBM, rather than a shortage of compute, is one of the tightest constraints on the entire AI build-out. Yet because the data still has to leave the memory stack and travel into the die, HBM left just enough room on the table, in latency and in energy, for a wave of challengers to attack.\n\nSRAM (Static Random Access Memory) is the memory built directly onto the chip itself and has been around for [many decades](https://en.wikipedia.org/wiki/Static_random-access_memory). It is fundamentally different from DRAM in how it stores bits: instead of capacitors that must be constantly refreshed, SRAM uses stable transistor circuits, which makes it both extremely fast and very energy efficient per access. A read or write can happen in a few nanoseconds or less, and because it is on die, there is no external bus bottleneck. In practice, this makes SRAM the only memory fast enough to keep modern tensor cores or matrix units continuously fed at full utilisation.\n\nSo the trade-off at the heart of every design below is simple to state. DRAM is out of the picture. SRAM is blindingly fast but tiny. HBM is large but slower. Nvidia bet on HBM, which gives it capacity and flexibility. A handful of companies looked at that single compromise and picked a different way around the wall.\n\nCerebras looked at the memory wall and asked a wonderfully blunt question — on a GPU the HBM already sits just centimeters away on the same package, and even that tiny journey is the bottleneck. So why not go to the logical extreme and design an enormous slab of silicon that can fit enough SRAM to accommodate LLM workloads?\n\nNormally a single round 300mm silicon wafer is chopped into hundreds of small, identical chips, and there are good reasons for that, the main one being yield. Manufacturing scatters microscopic defects across the wafer and because a single defect can kill an entire chip, the bigger the chip the more likely it is to be born dead. Cut the wafer into hundred small dies and a few faulty ones barely matter; try to make the whole wafer one chip and, statistically, dozens of defects would render it useless. That defect arithmetic is the real reason nobody had built a wafer-sized chip before. Cerebras’s trick was to [design around it](https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem): the wafer is paved with hundreds of thousands of tiny identical cores plus redundant wiring, so when a defect knocks out a core the chip simply routes around it, a scheme Cerebras says gives it about a hundred times the defect tolerance of a conventional design.\n\nThe result is the Wafer Scale Engine 3, a single chip the size of a dinner plate. [It holds 44 gigabytes](https://arxiv.org/abs/2503.11698) (!) of SRAM right there on the wafer. That last number is the whole point. SRAM has always been the fast-but-tiny tier; a high-end GPU carries only tens of megabytes of it on the die. By treating an entire wafer as one chip, Cerebras fits roughly a thousand times more on-chip memory than a conventional processor, enough to keep a whole LLM in the fastest memory there is rather than streaming it from HBM. Because of that the internal bandwidth reaches an almost comical 21 petabytes per second. Set that beside the NVIDIA’s H200 4.8 terabytes per second and Cerebras has roughly 4375 times the bandwidth.\n\nThe downside however is that 44 gigabytes of lightning-fast SRAM is still only 44 gigabytes, enough for a medium-sized model at full precision but not a large one. A model much larger than about 30 billion parameters (depending on quantization) will not fit. What do you do? You put number of wafers together, forcing the data to leave the comfort of SRAM, severely hurting latency, reintroducing the inter-chip delay the single-wafer design existed to abolish. Cerebras is a magnificent answer to the memory wall as long as your models are small enough to live on a single wafer.\n\nGroq attacked a different weakness: a GPU is fast on average, but its timing is unpredictable leading to inefficiencies.\n\nGPUs are designed to execute massive amounts of work in parallel. Because different threads and memory accesses finish at different moments, the hardware schedules work dynamically as it runs, switching between tasks, hiding memory delays, and leaning on caches to stay busy. That flexibility is exactly what makes GPUs so good at the parallel prefill phase. But it also makes their timing variable, and variability is poison for the sequential decode phase, where you want every step to take the same predictable, minimal amount of time.\n\nGroq’s chip, which it calls an LPU (Language Processing Unit), takes the opposite approach to a GPU. Rather than deciding what to do at runtime, almost every operation is [planned in advance by the compiler](https://medium.com/the-low-end-disruptor/groqs-deterministic-architecture-is-rewriting-the-physics-of-ai-inference-bb132675dce4), the entire execution schedule fixed before the program runs, with data movement and computation orchestrated cycle by cycle. By trading the GPU’s dynamic improvisation for compile-time planning, the LPU delivers strikingly predictable, optimized performance resulting in very low latency.\n\nLike Cerebras, Groq keeps its memory on-chip as SRAM, and like Cerebras it pays the SRAM capacity tax: each LPU holds only a few hundred megabytes, far too little for a whole model, so a large model must be spread across hundreds of chips wired together. Where Nvidia puts a big pool of HBM next to each chip, Groq uses a sea of small, fast, tightly choreographed chips instead.\n\nLike Cerebras, Groq keeps its memory on-chip as SRAM, and like Cerebras it pays the SRAM capacity tax: each LPU holds only a few hundred megabytes, far too little for a whole model, so a large model must be spread across hundreds of chips wired together. Where Nvidia puts a big pool of HBM next to each chip, Groq uses a sea of small, fast, tightly choreographed chips instead.\n\nThe strongest evidence that this approach matters did not come from a benchmark. It came from Nvidia itself. In December 2025 Nvidia struck a roughly [20-billion-dollar](https://techcrunch.com/2026/05/29/after-nvidias-20b-not-acqui-hire-ai-chip-startup-groq-reportedly-raising-650m/) licensing deal with Groq, folding its technology and some engineering team into its own platform. When the incumbent pays twenty billion dollars to bring a challenger’s idea in-house, that is the market confirming the idea was real. And at GTC 2026 the deal turned into a product that is the disaggregation idea made literal: the Groq technology reappeared as the [Groq 3 LPX](https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform), a rack-scale accelerator co-designed with Nvidia’s next-generation [Vera Rubin platform](https://futurumgroup.com/insights/nvidia-gtc-2026-day-1-can-nvidias-ecosystem-accelerate-the-inference-inflection/), where the Rubin GPUs handle the parallel prefill and the Groq LPX chips take over the latency-sensitive decode. Jensen Huang called it extreme co-design and claimed the pairing delivers up to 35 times higher throughput per megawatt on trillion-parameter models.\n\nd-Matrix accepted the memory wall as the central problem, like everyone else, but proposed the cleverest sidestep of all. If the costly part is carrying the weights from memory to the compute units, then stop carrying them: do the calculation right where the data already lives.\n\nThis approach, called in-memory computing, challenges one of the oldest assumptions in computer design, the von Neumann architecture, in which memory and processing are separate places and data constantly commutes between them. That commute is exactly what the memory wall is made of. d-Matrix’s [Digital In-Memory Compute (DIMC)](https://www.servethehome.com/d-matrix-corsair-in-memory-computing-for-ai-inference-at-hot-chips-2025/) builds arithmetic directly into the SRAM structures that store the model’s weights, so many operations happen where the data already sits rather than being shuttled across the chip. In a sense it goes one step beyond both HBM, which sits beside the compute die, and even conventional SRAM, which sits on the die but still keeps memory and compute as separate neighbourhoods\n\nAgainst Nvidia, the pitch is not maximum general-purpose performance but maximum inference efficiency. Where a GPU is a versatile engine that continually moves weights and activations back and forth, d-Matrix’s Corsair accelerator is purpose-built to barely move them at all, cutting energy use and squeezing out latency.\n\nThe first three challengers all, in their different ways, attack the memory. The last two attack something else entirely: the very idea that a chip should be able to run more than one thing.\n\nEvery chip we have discussed so far, even the strange ones, is still programmable; it can run different kinds of neural network. Etched asked whether that flexibility is worth paying for at all, given that essentially every important AI model today is built on a single architecture, the transformer.\n\nTo see the bet, you need the difference between a GPU and an ASIC. A GPU is general-purpose: it can run many kinds of workload, which makes it wonderfully flexible, but some of its silicon and power is always spent on capabilities a given task never uses. An ASIC (Application-Specific Integrated Circuit), is the opposite. It is built for one narrow class of computation, and because it spends nothing on generality it can be dramatically faster and more efficient at its one job, while being useless for anything else. A GPU is a versatile workshop full of tools; an ASIC is a single machine built to perform one production step superbly.\n\nEtched’s chip, Sohu, is an ASIC that runs transformers and only transformers. It physically cannot run the older kinds of neural network, such as the convolutional networks used for images or the recurrent networks that preceded transformers. In exchange, Etched claims Sohu is many times faster and cheaper on transformer inference, [citing figures](https://theaiworld.org/news/etcheds-500m-sohu-chip-takes-aim-at-nvidia) of roughly 15 times the speed, a tenth of the cost and a ninth of the power for generating tokens from a large Llama model. These are the company’s own claims rather than independent benchmarks, so they deserve a pinch of salt, but the direction is clear.\n\nTellingly, and unlike the first three, Etched does not abandon HBM at all: Sohu reportedly uses 144 gigabytes of HBM3E per chip, even more than an Nvidia H100. That is the giveaway that Etched is not fighting the memory war. Its bet is pure specialisation. Against Nvidia’s promise to run anything, it offers the opposite: we run one thing, perfectly. The risk is just as pure: if the transformer is ever dethroned by a new architecture, a chip that can only run transformers becomes very expensive scrap.\n\nAnd then there is Taalas, which takes the logic of specialisation to its absolute, and slightly mind-bending, conclusion. Etched builds hardware for a class of models. Taalas builds hardware for one specific trained model.\n\nOn every other chip, including Etched’s, the weights live in memory and are fetched whenever they are needed, and moving them is the single biggest contributor to the memory wall. Taalas attacks that at the root. Instead of storing the weights as data, it [etches them directly into the chip’s circuitry](https://www.sdxcentral.com/news/chip-designer-taalas-bets-on-hard-wired-ai-chips/). They are no longer loaded from SRAM or HBM; they become part of the hardware itself. What was once a memory access becomes a signal propagating through fixed logic. This does not abolish memory entirely, the chip still needs working memory for activations and the KV-cache, but it removes the need to haul billions of parameters across the chip for every token, which was the largest source of movement of all. The result is not just an efficiency gain but a step change in throughput: early systems for the Llama3.1–8 model [have demonstrated](https://www.eetimes.com/taalas-specializes-to-extremes-for-extraordinary-token-speed/) well over 17,000 (!) tokens per second versus roughly 2,000 for Cerebras, 600 for Groq, and Nvidia Blackwell-generation hardware at around 350.\n\nThe trade-off is the most extreme in this article, and it follows straight from the design: the chip is welded to a single model. You cannot load new weights, fine-tune it, or upgrade to a newer model without manufacturing new hardware. Taalas’s answer is speed of manufacturing: rather than designing a fresh processor each time, it claims it can work with TSMC to alter only a small number of layers, turning a trained model into a production chip in roughly two months.\n\nThis is the mirror image of Nvidia’s philosophy. Nvidia sells a flexible processor and lets software evolve on top of it; new models arrive as code. Taalas flips the relationship so that the model itself becomes the hardware. Nvidia bets that the future is uncertain and flexibility is priceless. Taalas bets that for sufficiently popular and stable models, the efficiency of hardwiring them into silicon is worth losing the ability to change them.\n\nStep back from the five companies and a beautiful pattern appears. They are not five random ideas. They are points along a single line, a gradual trade of flexibility for the speed and efficiency that inference rewards.\n\n· **NVIDIA****:** General-purpose GPU platform optimised for massive parallelism across both training and inference. Unbeatable flexibility, but not built for deterministic low latency.\n\n· **Cerebras****:** Wafer-scale chip with enormous on-chip SRAM and bandwidth, keeping a whole model in the fastest memory there is, best suited to small and medium models at extreme speed.\n\n· **Groq****:** Deterministic dataflow architecture that fixes the schedule in advance, purpose-made for ultra-low-latency, token-by-token decoding.\n\n· **d-Matrix****:** Digital in memory compute, performing arithmetic directly within the memory array to minimise data movement.\n\n· **Etched****:** An ASIC that runs only transformers, trading all flexibility for raw specialisation, and the one challenger that keeps HBM rather than fighting the memory war.\n\n· **Taalas****:** The model compiled into silicon, its weights fixed in the wiring, extraordinarily efficient for a single model and useless for any other.\n\nTwo things jump out from this map. The first is that nobody here is seriously claiming they will replace Nvidia for the enormous, flexible job of training frontier models, where raw power and CUDA still rule absolutely. They are going after the daily running cost of AI, the part that happens billions of times a day, because that is where latency is won or lost and where Nvidia’s generality is least necessary.\n\nThe second is that the whole spectrum is really one long argument about a single question: how much flexibility are you willing to surrender to escape the memory wall? Nvidia surrenders none and pays in efficiency. Taalas surrenders everything and is rewarded with extreme token generation speeds. Everyone else is somewhere in between, each convinced they have found the sweet spot.\n\n[Nvidia Already Won Training. The Real Fight Is Inference](https://pub.towardsai.net/nvidia-already-won-training-the-real-fight-is-inference-a7dcf1cb8e72) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/nvidia-already-won-training-the-real-fight-is-inference", "canonical_source": "https://pub.towardsai.net/nvidia-already-won-training-the-real-fight-is-inference-a7dcf1cb8e72?source=rss----98111c9905da---4", "published_at": "2026-06-30 16:31:00+00:00", "updated_at": "2026-06-30 16:56:56.644880+00:00", "lang": "en", "topics": ["ai-chips", "ai-infrastructure", "ai-products", "ai-research", "ai-startups"], "entities": ["Nvidia", "Cerebras", "Groq", "d-Matrix", "Etched", "Taalas", "CUDA"], "alternates": {"html": "https://wpnews.pro/news/nvidia-already-won-training-the-real-fight-is-inference", "markdown": "https://wpnews.pro/news/nvidia-already-won-training-the-real-fight-is-inference.md", "text": "https://wpnews.pro/news/nvidia-already-won-training-the-real-fight-is-inference.txt", "jsonld": "https://wpnews.pro/news/nvidia-already-won-training-the-real-fight-is-inference.jsonld"}}