Three HPC Gurus Ask: Do We Still Need GPUs?

Three high-performance computing experts—Jack Dongarra, Torsten Hoefler, and Satoshi Matsuoka—are questioning the necessity of GPUs in AI and scientific computing in a forthcoming paper, prompted by the emergence of the all-CPU supercomputer LineShine, which ranks as the fastest AI/HPC system in the world. The paper argues that modern CPUs with matrix engines and high-bandwidth memory can rival GPUs, challenging Nvidia's dominance in AI training and HPC workloads.

Three HPC Gurus Ask: Do We Still Need GPUs? Yes, that simple question is, in the modern Nvidia world that has come to dominate AI training and to a certain extent HPC simulation and modeling, heretical. But given that CPUs are in many cases starting to look more like GPUs, with their hybrid vector and matrix math engines, mixed precision support, in some cases HBM stacked and high bandwidth memory as well as fatter DRAM main memory, and integrated interconnects, it is also a logical question. And so when Jack Dongarra https://www.linkedin.com/in/jack-dongarra-028544/ , of the University of Tennessee and of Oak Ridge National Laboratory for 36 years, Torsten Hoefler https://www.linkedin.com/in/torsten-hoefler/ of ETC Zurich and chief architect for AI/ML at CSCS, and Satoshi Matsuoka Satoshi%20Matsuoka of RIKEN lab and of Tokyo Institute of Technology, ask that question rhetorically and answer it, people listen. That question is answered in a forthcoming paper that will be on arXiv as well as in flagship publication of the Association for Computing Machinery called Do We Still Need GPUs? Rethinking AI and Scientific Computing on Matrix-Enhanced CPUs , which you can read for yourself at this link until it is published https://www.dropbox.com/scl/fi/u389vin4cnuvk2s3x7hcg/Do-We-Still-Need-GPUs-CACM-Revised.pdf?rlkey=ha2qziv1bka45l487zmqfswfb&e=1&dl=0 . And that question was prompted by the existence of a new all-CPU supercomputer called “LineShine” that is the fastest AI/HPC supercomputer in the world, according to the latest Top500 rankings that came out this month https://www.nextplatform.com/hpc/2026/06/23/amd-and-nvidia-are-neck-and-neck-in-hpc-supercomputing/5260202 . I did a deep dive on the processors, memory, and interconnects of the LineShine machine and its Chinese-made LX2 Arm server CPU here https://www.nextplatform.com/hpc/2026/06/25/a-deep-dive-on-chinas-lineshine-all-cpu-exaflops-class-supercomputer/5262439 , which you need to read to get this paper that the three HPC gurus are going to put out there. You Go To War With The Compute Engines You Have Here is the thesis, which I will embellish. With enthusiasm. The paper compares and contrasts the A64FX processor and the architecture of the “Fugaku” supercomputer at RIKEN Lab, which went into full operation in March 2021, with the LineShine machine, which appears to have been fired up last fall. Both AI/HPC supercomputers are all-CPU designs, as was Fukagu’s predecessor at RIKEN, the “Project Keisoku” K supercomputer that went into full production in September 2012. For your reference, I did a deep dive on the K system back at https://www.theregister.com/2009/11/27/fujitsu venus tofu/ The Register https://www.theregister.com/2009/11/27/fujitsu venus tofu/ here https://www.theregister.com/2009/11/27/fujitsu venus tofu/ . Fujitsu moved from Sparc to Arm architecture with the Fugaku machine, the deep dive on the A64FX processor is here https://www.nextplatform.com/2018/08/24/fujitsus-a64fx-arm-chip-waves-the-hpc-banner-high/ and the Tofu D companion interconnect is there https://www.nextplatform.com/2018/09/14/slicing-into-the-post-k-supercomputers-tofu-d-interconnect/ . Here is the funny bit as it relates to the K machine. It was not an all-CPU machine by choice. Back in 2008, the idea was for Japan to engage its three big supercomputer and system makers to create a hybrid machine that combined CPU compute made by Fujitsu as well as vector accelerators from Hitachi, with NEC working on a multiple dimension mesh/torus interconnect to link all of the CPU and vector nodes to each other to share work. But in May 2009, with the great Recession roaring and Hitachi and NEC unsure how they could afford to do the K system development and manufacturing, they both pulled out of the deal https://www.theregister.com/2009/05/14/nec pulls out of keisoku/ , leaving Fujitsu to create its very good “Venus” Sparc64-VIIIfx processor, which had big fat vector engines on it. The Tofu interconnect that Fujitsu finished after some initial development by NEC was also very good, and the resulting K machine was not only the fastest supercomputer in the world, but it was also the most efficient machine across many workloads. In fact, even Fugaku cannot beat its computational efficiency, even with the third generation Tofu D 6D mesh torus interconnect. It is hard to squeeze all the flops out of progressively larger machines. Why single out Fugaku and LineShine for the comparison? Well, both have been used to support trillion parameter GenAI models and both are also supporting traditional ModSim codes as well as mixing AI and HPC workloads to get real stuff done. The paper’s authors correctly point out that GPUs became compute engines in the first place because CPUs were not delivering enough flops at multiple precision and their memory subsystems did not supply enough bandwidth even if they did have a lot of math embedded into their designs. It was the combination of a lot of vector math and then even more powerful tensor math plus fast GDDR and then HBM stacked memory, which is skinny on capacity but enough to do useful work once solvers were parallelized, that made the GPU indispensable. The CPU makers were happy to sell two things instead of one, and eventually Nvidia was happy to sell CPUs and not just GPUs. But all things being equal, HPC shops would have simply preferred to stick with scale out networks and have CPUs with turbocharged math capabilities. Gradually, ever so slowly, this is happening. The paper’s authors call out the SVE vector extensions that were added by Arm with its Neoverse Armv8.2-A architecture that debuted with the Fujitsu A64FX processor in 2016 and the improved SVE2 vector units that were added with the Armv9-A architecture in 2019. The Arm SME matrix unit was also added with the Armv9-A architecture, and improved SME2 matrix engines were added to the Arm9.4-A architecture in 2022. They also acknowledge that Intel has added its AMX matrix units to its Xeon server spec in 2020, and its “Sapphire Rapids” CPUs, notably used in the “Aurora” supercomputer at Argonne National Laboratory, were the first Xeons to implement AMX. The AVX-512 vector units debuted way back in 2016 with its “Knights Landing” Xeon Phi accelerators and were eventually brought over to the Xeons proper. We will point out that IBM Power10 and Power11 server chips as well as its z16 and z17 mainframe processors all have matrix engines – and got them into production long before any Arm or Xeon CPU. IBM’s two families of processors are also the only ones that support native decimals with two digits after the decimal point, which is what I call “money math,” where you don’t want rounding issues. AMD certainly has matrix math units it can add to its Epyc processor cores or within the Epyc chip package, but thus far it has not. The Power10, shipped in September 2021, and the Power11, shipped in July 2025, add vector and matrix units to each core, but the “Telum” and “Telum-II” mainframe chips have them on the die outside of the cores. For IBM, GenAI inference is its preferred HPC workload, and its systems are architected as such so you don’t need GPUs. IBM got tired of losing money as a prime contractor for HPC systems for the big national labs in the United States and Europe. The Arm and Xeon designs put matrix units in every core as well. There are three ways to skin this cat without leaving the security perimeter of a server node, and the other one, which IBM also does, is to have a much larger matrix math units sitting on PCI-Express cards under the server skins. This is Big Blue’s Spyre accelerator https://www.nextplatform.com/ai/2025/10/10/ibm-ships-homegrown-spyre-accelerators-embraces-anthropic-for-ai-push/1642396 , which came out of IBM Research. Eventually, perhaps with Power12, IBM will put the same exact matrix unit on all three compute engines. As it is, the MMA used in Power10 and Power11 is very different from the matrix units used in the z16, z17, and Spyre. Why Did We Go To GPUs In The First Place? When GPUs were first coming into the HPC space in the late 2000s, many years before statistical AI became a thing and long before GenAI exploded on the scene from the models as they grew in data scale and type – and had to the compute scale to build the massive neural networks with emergent behavior – you paid 3X as much for a GPU accelerated machine to get around 3X the performance compared to a CPU-only system. There was not a very good price/performance argument initially. The main benefit you got was higher memory bandwidth, which meant time to answer improved, and lower energy costs for FP64 and FP32 compute. Eventually, as the HPC software stacked matured, the performance delta compared to CPUs grew and the price/performance also got better for GPU systems. There is a gap, which you can see in the Top500 rankings. But that gap comes at a pretty steep cost: You have to split your code so numerically intense parallel routines are offloaded to the GPU and serial work remains on the CPUs. The GPU is a distinct device, and you have to move data back and forth between these devices. The latter takes energy, the former takes money. Perhaps with GenAI coding assistants, this hybrid computing model can be made easier to implement. This GPU offload is not for everyone, which is why only about half of the Top500 machines tends to be accelerated in some fashion. The other half are still CPU-only machines, after a decade and a half of heavy commercialization of GPU-accelerated machines. Yes, they are less power efficient and less impressive, but you don’t have to pay Nvidia’s CUDA tax on its GPU hardware, and you don’t even have to go to very expensive HBM memory, either. Here is how Dongarra, Hoefler, and Matsuoka see the situation today, particularly in the wake of the LineShine machine and its LX2 processor being launched. “The strongest argument, then, is not that GPUs are useless. They are not. GPUs are extremely powerful and will remain important. The argument is that GPUs are not fundamentally required if the CPU evolves to include the architectural features that made GPUs attractive. A CPU with SVE/AVX, SME/AMX, HBM, multiple precision formats, and matrix-multiply capability is no longer a conventional CPU. It is a general-purpose processor with accelerator-class numerical machinery.” “This shift could be especially important for the convergence of AI and scientific computing. Future scientific applications will not simply run simulations or train neural networks separately. They will combine simulation, data assimilation, optimization, uncertainty quantification, and machine learning in tightly coupled workflows. These workflows need both AI-style tensor throughput and traditional HPC capabilities: MPI communication, double precision, sparse solvers, adaptive algorithms, file I/O, and complex control logic. A CPU with integrated matrix acceleration may be a cleaner platform for this convergence than a system that requires constant movement between a host CPU and a discrete GPU.” You can read the paper to get some insight into how the trillion parameter inference models work on Fugaku and LineShine, and see how the addition of the matrix math unit with the custom LX2 Arm server CPU in LineShine makes all the difference. LineShine is installed at NSC Shenzhen in China. But here is what I want you to think about. The LX2 chip is probably etched by Semiconductor Manufacturing International Corp SMIC , China’s wannabe competitor to Taiwan Semiconductor Manufacturing Co, and it probably uses an advanced 7 nanometer process node to do it, and given the size of the chip, that means it can only run at 1.55 GHz to stay in a 650 watt thermal envelope. If the LX2 was etched in 3 nanometer processes at TSMC, the clock speed could go higher, the chip could be smaller and less costly, and the cost per chip may go down a little. Such a move would almost certainly cut the number of processors needed and therefore the number of cores needed in half to get to the same 2.74 exaflops peak theoretical performance as the actual LineShine has. And instead of burning 42.2 megawatts of power to do it, it might do it at something closer to 25 megawatts. On the High Performance Linpack test, the real LineShine is rated at 2.2 exaflops, and if you do the math, that is 52.1 gigaflops per watt. But if that could be dropped to 25 megawatts by a few process node jumps, then at 3 nanometers for the LX2 CPU, LineShine would be at around 87 gigaflops per watt. “El Capitan,” the number two machine on the June Top500 list that is installed at Lawrence Livermore National Laboratory, is “only” at 60.9 gigaflops per watt. The most power efficient machines on the list June list are based on the combination of Nvidia “Grace” Arm server CPUs and “Hopper” H100 and H200 GPU accelerators, and they run at around 70 gigaflops per watt on HPL. I would love to be able to do price/performance per watt analysis on these machines, but we don’t know what LineShine costs. Here is another thing to think about in what might be a post-GPU world. If the balance for agentic AI is moving towards one CPU for every one GPU, then embedding matrix math in the CPU can create a “superchip” as Nvidia calls its Grace-Hopper, Grace-Blackwell, and soon Vera-Rubin hybrids, which are two or three distinct chips combined on a single system board. Why have an offload model at all if you can just integrate these functions? One last thing: I have argued for 2,048-bit vector engines. Intel’s chief Xeon architect, who is no longer with the company, looked at me like I was crazy. I am beginning to think maybe 4,096-bits might suit the future just fine. . . . So, given all of this, don’t be surprised when Nvidia adds tensor cores to a future Arm server CPU. It is going to happen. And if Nvidia is wise – and it is most definitely wise beyond its years – then it will be the same tensor cores used in its GPU accelerators so that turbocharged CPU can run the same CUDA X software stack.