Nvidia Extends Its Grip On The AI Datacenter Outwards

wpnews.pro

Nvidia wants the modern AI datacenter to be more like an Apple product, and with announcements it just made at the Computex conference in Taiwan, the company is getting closer to achieving that ambitious goal.

Creating a platform that not only spans the datacenter, but includes the facility itself and reaches out to interface with the power grids that feed datacenters, is not a trivial thing. And making one that meshes with rowscale and rackscale systems and their control planes, which in turn mesh with the software stacks running on individual server nodes and the CPUs and accelerators at an even finer granularity with their own libraries and monitors is a Herculean task. One that a hyperscaler or a cloud builder could do in recent years, and maybe AI model builders are crafting behind the scenes at this moment. Nvidia, which has a damned near complete AI platform generating hundred of billions of dollars a year, can also do it, and this is what the company’s DSX effort is all about.

DSX adds another outer matryoshka doll in an already very deep software hierarchy created by Nvidia, which I will remind you once again is a software company – 75 percent of its nearly 40,000 employees work on software – that happens to design the most sophisticated hardware systems the world has yet seen with the other 25 percent of the employees. We do not know what DSX stands for any more than we did know what the DGX server brand stood for. (We guess that DGX was short for Datacenter GPU with X86 host, and that DSX means Datacenter Simulation with X for variable meaning monitoring, management, and simulation.)

The inner matryoshka doll in the Nvidia software hierarchy is comprised of the more than 900 domain-specific libraries and software stacks for accelerating computing on GPUs that Nvidia calls CUDA-X. This is a horizontal stack that includes key libraries such as cuBLAS, cuDNN, cuSolver, cuDF, cuOpt, cuML, cuGraph, and others and a collection of vertical stacks that have optimizations for AI, HPC, data analytics, genomics, quantum physics, chip design, and others. AI Enterprise for AI training and inference and Rapids for data analytics are two examples of vertical software stacks built upon CUDA-X. These are separately licensed; you get CUDA-X by virtue of buying an Nvidia GPU, and if you want tech support for it, you buy a stack license, most likely AI Enterprise. For the HPC folks, there is something called the HPC SDK that rolls up all of the needed libraries for classical ModSim work as well as the Nvidia compilers (from its acquisition of The Portland Group way back in July 2013).

Higher up one level, there is a matryoshka doll layer called Dynamo, which builds on that foundation and creates models and highly tuned engines for running AI inference. (There really is not an analog from Nvidia for AI training, but it could easily make one. Most of the big AI model builders have their own way of stacking up their training cluster software.) Nvidia calls Dynamo an operating system for AI inference, but we think that is a misuse of that term. Dynamo is a platform layer in the stack that does inference. And regardless of the fact that Nvidia is talking about something called DSX OS now, we think this stretches the definition of an OS a bit. But not as much as its Dynamo claim. If and when the datacenter and all aspects of the facility – power, cooling, and space – are under software control, perhaps you can call this an OS. We think that it is just management and modeling software. Take that for what you will.

Here is what the DSX OS software stack looks like:

And here is a much more detailed version of the stack:

The initial DSX software has evolved in stages, and was originally called Omniverse DSX Blueprint when it previewed back in October 2025. It was basically a digital twin for designing and operating megawatt and gigawatt scale datacenters with Nvidia gear inside. This is now called DSX Sim.

As you well know, these days every stack starts with a common API so elements of the stack can talk to each other, and DSX OS is no different. DSX Exchange, the second module in the DSX OS stack, was announced back in March is the API hub, allowing for all of the elements of the infrastructure to talk to each other. This is important because maximizing the efficiency of the GPUs in the servers in an AI datacenter, you have to do it within the context of the datacenter and the power it can use at any given second. Some control telemetry flows from the bottom up, and some flows from the top down.

DSX Flex, the third module of the DSX OS stack, adds connectivity of that central API in the datacenter out to the power grid and their operations control systems so AI datacenters don’t ask for what they cannot have.

Both of these pieces of software have been available with Nvidia’s current platforms, which are based on the “Grace” Arm server CPU and the “Blackwell” B200 and B300 GPU accelerators and a judicious mix of NVSwitch memory interconnects, BlueField DPUs, and QuantumX InfiniBand or SpectrumX Ethernet scale out networks, plus the new STX flash storage layer integrated into racks.

With the next generation of Nvidia platform, which is based on the “Vera” CPU and “Rubin” GPU plus upgraded networks and storage and which are due later this year, Nvidia is adding more modules to the DSX OS stack – and ones that hook into the dynamic power features of the Vera-Rubin platform.

This fourth module is called MaxLPS, and it is not at all clear what LPS stands for. What is clear is what it does. As Ian Buck, vice president of AI and HPC at Nvidia, explained in in a prebriefing ahead of the GTC Taipei event this week (which coincided with the annual Computex conference in Taiwan), MaxLPS is all about getting the most performance for the most power drawn by the datacenter.

“DSX MaxLPS is a suite of technologies specifically designed to work with Vera-Rubin's next-generation hardware dynamic power features to orchestrate and maximize compute throughput, and to optimize the number of datacenter factory GPUs in a power limited world,” Buck explained. “MaxLPS helps operators recover all the stranded power. It does this by real-time monitoring of power and provisioning for every GPU, every rack, and every row across the entire datacenter. By doing so, datacenter operators can now populate even more GPUs and CPUs in a fixed power datacenter and know that they will operate safely and within their total power budget. With MaxLPS, AI factories can safely deploy up to 40 percent more GPUs within the same power envelope. That's 40 percent more compute, 40 percent more tokens, and 40 percent more revenue that was possible before.”

That comparison is to prior Grace-Blackwell platforms, and this is one of the reasons why, according to Buck, the order book is full for Vera-Rubin platforms. The idea is to reduce time to revenue and to boost tokens per watt when operating AI supercomputers running inference at scale.

Here is what the block diagram for the pieces of the DSX OS stack looks like and how they all feed into each other:

Nvidia says that DSX OS is both open and modular, and it is absolutely intended that all of the various management tools for all elements of the datacenter and power grids can plug into it. Open does not mean open source, so don’t jump to that conclusion. Not everyone is going to open up their tools to plug into DSX OS, and it is not clear that Nvidia will open up all of the code behind the stack.

There are pieces of the stack that are not called out as DSX OS modules but are almost certainly part of them that are open source. For instance, NVSentinel, a tool for managing the health of GPUs and automagically remediating issues for applications running in a Kubernetes container environment, is open source. So is the KAI Scheduler for AI workloads running on Kubernetes, and so is the Nvidia Cloud Functions control plane and compute framework for AI stacks. The DSX Exchange API hub and integration module is open source. Nvidia tends to favor the Apache 2.0 license for its systems software.

Open or not, we think a lot of neoclouds, sovereigns, and enterprises are not even going to try to reinvent the wheel and will adopt DSX OS to manage their AI datacenters with Nvidia iron. There is even a chance that others will grab DSX Exchange and put it at the heart of their datacenter management plane. (We are talking to you, AMD.) It would be good to have one API messaging integration from the datacenter down to the AI systems and out to the power grid. But hell might have to start getting cold – meaning key Nvidia and AMD customers start making a fuss – before this happens.

Heaven only knows what commercial support will cost for this, but assume it will not be cheap. Nvidia still wants a software business in its P&L someday.

source & further reading

nextplatform.com — original article Marvell Brings Radix, Low Latency, And Bandwidth To Bear With Teralynx T100 AI Chips Drive Around A Third Of TSMC Revenues Why AI-Ready Data Is The Real Advantage

Nvidia Extends Its Grip On The AI Datacenter Outwards

Run your AI side-project on zahid.host