AI has completely shifted how the tech industry thinks about infrastructure.
Though people usually think of the GPU crunch when they consider infrastructure bottlenecks, chips are only one cog of a much larger machine. As some of these components become more capable, the rest of the system needs to keep up. It's why CoreWeave decided to build around these problems.
"That concept has now created the need for us to think about the data center from an AI-centric perspective at a rack level, versus just at a machine level," Corey Sanders, SVP of product at CoreWeave, told The Deep View.
The Deep View sat down with CoreWeave for an exclusive interview to discuss new innovations built specifically around Nvidia's Vera Rubin NVL72, the chip giant's 72-GPU system built for AI supercomputing.
The biggest innovations lie in two areas, said Jacob Yundt, senior director of compute architecture at CoreWeave: cooling and control.
Valvey: Yundt's team developed a programmable valve system, affectionately named "Valvey," that controls the way that coolant flows through server racks, using software to monitor pressure, flow rate and leaks. The impact is that this dramatically reduces downtime caused by rack failures, isolating the incidents to prevent a domino effect of shutdowns.Racky: The team also created a single controller, called "Racky," that sits on top of each rack to control everything in one interface, including power, cooling, and environmental sensor data. This creates a single unified system that allows customers to more easily manage their racks, giving them the ability to scale up their infrastructure more smoothly.
While these updates may sound esoteric, the downstream impacts matter, said Sanders. The goal is to make managing these systems a more seamless, unified process and to save time and costs by minimizing downtime. "One of the things I'm excited about with Vera Rubin and our partnership with Nvidia is that it changes the shape of what's possible with our end customers," said Sanders. "They can experiment a lot more. They can take more chances. They can innovate more."
It's something that's vital as AI agents force these systems to work harder. And given that CoreWeave provides services to nine out of the 10 leading model providers, "the rack really matters for our customers and their workloads," said Sanders.
"I think the future of AI workloads will span even multiple data centers and potentially even span multiple clouds," Sanders added.
Our Deeper View #
AI has created a ripple effect that's causing growing pains for practically every layer of the tech stack, from the software, to the chips and server racks, to the energy itself. However, it's impossible to ignore that several of these innovations often revolve around a central entity: Nvidia. The chip giant, for instance, was part of OpenAI's coalition for developing Multipath Reliable Connection, an open standard for networking designed to make GPU clusters faster, more reliable and more efficient. Though AI has undoubtedly created a broad motivation to innovate, the fact that Nvidia has a hand in the innovations spanning practically every layer of its self-described "five-layer cake," from open source standards to models to rack design itself, allows it to further solidify its dominance as the kingpin of the tech that sits at the foundation of the AI industry.